3 Reading in a Paper

The first step to using Metacheck is to read in a manuscript. Currently, Metacheck relies on GROBID, which works with PDF files. If you have a manuscript in a different format, save a copy as a PDF. You have two options to convert a PDF to the GROBID .xml format. First, you can use a local version of GROBID. We will explain this below, but it requires more technical knowledge, and sufficient space on your harddrive. Alternatively, you can use an online server. As this is much simpler, and you can use a GROBID server at Eindhoven University of Technology that is GDPRR compliant and does not store any data. This is the approach we will use by default in this manual. Nevertheless, you might prefer not to send your manuscript to an external server, or you might not be allowed to. We have developed Metacheck to always work locally, without an internet connection, to also make it available in these more restrictive situations.

The function convert() can read PDF files and save them in JSON format. This requires an internet connection and takes a few seconds per paper. The conversion process needs to be done only once and the converted file can be saved for later use. If you don’t supply an api_url, Metacheck will use a list of active servers and choose the first available. By default, this is a GROBID server at Eindhoven University of Technology that is GDPR compliant.

You can also supplement the reference section created by GROBID with information retrieved from CrossRef by setting crossref_lookup = TRUE. This can increase the accuracy of the information in the reference list. Retrieving this information also requires an internet connection, and can take some time. You can always add this information later with the add_bibmatch() function.

pdf_file <- demofile("pdf")
json_file <- convert(file_path = pdf_file,
                     save_path = "converted",
                     crossref_lookup = TRUE)

3.1 Local conversion through GROBID in Docker

You can set up your own local GROBID server following instructions from https://grobid.readthedocs.io/. The easiest way is to use Docker. The following code installs GROBID 0.9.0 if you run it in your system terminal.

docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.9.0

If you set the api_url to the local path http://localhost:8070 (as illustrated below) the convert() function will default to the local GROBID server. Metacheck will by default use the local GROBID server if it is automatically detected.

json_file <- convert(file_path = pdf_file,
                     save_path = "converted",
                     method = "grobid",
                     api_url = "http://localhost:8070")

3.2 Load from JSON or XML

The function read() reads in papers created by convert(). It accepts both the bibr JSON files and the GROBID XML (TEI) files that conversion produces, so you can use whichever you have.

# read a converted JSON file
paper <- read(json_file)

# read a GROBID XML file directly
paper_xml <- read(demofile("xml"))

3.3 Batch Processing

The functions convert() and read() also work on a folder of files, or a vector of paths, returning a list of JSON file paths or paper objects, respectively. Batch processing is especially useful for scientists who want to be able to examine a large number of manuscripts. For example, a metascientist might be interesting in examining the uptake of open science practices over time, and run Metacheck across all articles published in a scientific journal. As part of Metacheck, we have created a paper repository of more than 15.000 open access articles that you can read in through the function papers_load(). See Using the Paper Database Corpora for how to load the available corpora.

When a folder contains both a .json and a .xml file for the same paper, read() uses the JSON and skips the duplicate XML.

3.4 The paper object

read() returns a paper object: a structured representation of the manuscript, stored as a list of tables. Everything Metacheck does — searching text, running modules, checking references — operates on this object, so it is worth understanding what is inside it and how to get at each piece.

A paper object has the class scivrs_paper, and you access its parts with the $ operator. Here are all the components:

class(paper)

#> [1] "scivrs_paper" "list"

names(paper)

#>  [1] "paper_id"  "info"      "author"    "text"      "section"   "url"      
#>  [7] "bib"       "xref"      "figure"    "table"     "eq"        "bib_match"

The components are summarised below; the rest of this section shows each one. Most are data frames, so you can work with them using base R or the tidyverse.

Component	What it holds
`paper_id`	A single string identifying the paper
`info`	One row of paper-level metadata (title, DOI, type, …)
`author`	One row per author
`text`	One row per sentence
`section`	One row per heading/section
`url`	Links found in the text
`bib`	The reference list
`xref`	Cross-references linking sentences to bib entries, tables, and figures
`figure`	Figures
`table`	Tables
`eq`	Extracted statistics and equations
`bib_match`	Reference metadata matched from CrossRef and other services

paper_id

A single string identifying the paper (the user-supplied id, the DOI, or the filename):

paper$paper_id

#> [1] "to_err_is_human"

info

A one-row data frame of paper-level metadata. Use t() to read it as a column:

t(paper$info) |>
  knitr::kable(col.names = "value")

	value
title	To Err is Human: An Empirical Investigation
keywords	NA
doi	10.32614/10.5281/zenodo.2669586
file_hash	62ede2964b174f6d
input_format	grobid 0.9.0
file_name	/Users/debruine/rproj/scienceverse/metacheck/inst/demos/to_err_is_human.xml
bibr_version	10.0
paper_type	unknown
paper_type_confidence	0
oecd_l1	NA
oecd_l2	NA
oecd_confidence	NA

Pull out a single field directly:

paper$info$title

#> [1] "To Err is Human: An Empirical Investigation"

paper$info$doi

#> [1] "10.32614/10.5281/zenodo.2669586"

author

One row per author, in paper order:

paper$author[, c("given", "family", "affiliation", "corresponding")] |>
  knitr::kable()

given	family	corresponding
Daniel	Lakens	FALSE
Lisa	Debruine	FALSE
Jakub	Werner	FALSE

text

One row per sentence (excluding the title, headings, and references). This is the table that text_search() searches.

nrow(paper$text)

#> [1] 37

head(paper$text[, c("text", "section_id", "paragraph_id")]) |>
  knitr::kable()

text	section_id	paragraph_id
This paper demonstrates some good and poor practices for use with the {metacheck} R package.	0	1
All data are simulated.	0	1
The paper shows examples of (1) open and closed OSF links; (2a) citation of retracted papers, (2b) citations without a doi, (2c) citations with Pubpeer comments, (2d) citations in the FLoRA replication database, and (2e) missing/mismatched/incorrect citations and references; (3a) R files with code on GitHub that do not load libraries in one location, (3b) load files that are not shared in the repository, (3c) lack comments, and (3d) have absolute file paths; (4) imprecise reporting of non-significant p-values; (5) tests with and without effect sizes; (6) use of “marginally significant” to describe non-significant findings; (7) a power analysis reporting some of the essential attributes; and (8) retrieving information from preregistrations.	0	1
Although intentional dishonesty might be a successful way to boost creativity (Gino and Wiltermuth 2014), it is safe to say most mistakes researchers make are unintentional.	1	2
From a human factors perspective, human error is a symptom of a poor design (Smithy, 2020).	1	2
Automation can be used to check for errors in scientific manuscripts, and inform authors about possible corrections.	1	2

section

One row per heading. section_type is Metacheck’s best guess at the kind of section, which modules use to focus on (say) only the results:

paper$section[, c("section_id", "header", "section_type")] |>
  knitr::kable()

section_id	header	section_type
0	Abstract	abstract
1	[div-01]	intro
2	Method	method
3	Procedure	method
4	Discussion	discussion
5	Data Availability	availability
6	Author Notes	contribution
7	Power Analysis	annex
8	Results	annex
9	Figure 2 :	figure
10	Table 1 :	table
11	Figure 1 :	figure
12		foot
13	References	references

url

Every external link found in the text, with the sentence (text_id) it appeared in:

paper$url |>
  knitr::kable()

href	link_text	text_id
https://osf.io/48ncu	NA	9
https://aspredicted.org/by8i8v.pdf	NA	9
https://github.com/Lak-	NA	20
https://researchbox.org/4377	NA	20
https://osf.io/5tbm9	NA	21
https://osf.io/629bx	NA	21

bib

The reference list, parsed into columns:

paper$bib[, c("bib_id", "authors", "year", "title")] |>
  knitr::kable()

bib_id	authors	year	title
0	Debruine, Lisa	2025	Faux: Simulation for Factorial Designs
1	Eagly, Alice H; Wood, Wendy	1999	The Origins of Sex Differences in Human Behavior: Evolved Dispositions Versus Social Roles
2	Gino, Francesca; Wiltermuth, Scott S	2014	Evil Genius? How Dishonesty Can Lead to Greater Creativity
3	Lakens, Daniël	2018	Equivalence Testing for Psychological Research
4	Smith, F	2021	Human Error Is a Symptom of a Poor Design

xref

Cross-references that link a sentence (text_id) to an item it cites — a bib entry, table, or figure (xref_type):

paper$xref |>
  knitr::kable()

xref_id	xref_type	contents	text_id
2	bibr	(Gino and Wiltermuth 2014)	4
0	foot	foot_0	8
0	figure	2	17
NA	table	1	25
1	figure	1	25

figure and table

Figures and tables each get a row, linked to the section table by section_id:

paper$figure[, c("figure_id", "section_id", "page_number")] |>
  knitr::kable()

figure_id	section_id	page_number
1	9	NA
2	11	NA

paper$table[, c("table_id", "section_id", "page_number")] |>
  knitr::kable()

table_id	section_id	page_number
1	10	NA

Statistics and equations extracted from the text, broken into a left-hand side, degrees of freedom, comparator, and right-hand side. Many statistics modules read from this table rather than re-scanning the text:

head(paper$eq) |>
  knitr::kable()

text_id	grp_id	lhs	df	comp	rhs
15	1	M	NA	=	9.12
15	1	M	NA	=	10.9
15	1	t	(97.7)	=	2.9
15	1	p	NA	=	0.005
15	1	d	NA	=	0.59
16	2	M	NA	=	5.06

bib_match

When you convert with crossref_lookup = TRUE (or run a reference module), Metacheck matches each reference against services like CrossRef and stores the retrieved metadata here, keyed by bib_id:

paper$bib_match[, c("bib_id", "service", "doi", "title")] |>
  knitr::kable()

bib_id	service	doi	title
1	crossref	10.1037/0003-066x.54.6.408	The origins of sex differences in human behavior: Evolved dispositions versus social roles.
2	crossref	10.1177/0956797614520714	Retracted: Evil Genius? How Dishonesty Can Lead to Greater Creativity
3	crossref	10.1177/2515245918770963	Equivalence Testing for Psychological Research: A Tutorial

3.5 Working with a list of papers

A paperlist is simply a list of paper objects — for example the built-in psychsci dataset (250 open-access Psychological Science articles). To pull the same component from every paper into one combined table, use paper_table():

paper_table(psychsci[1:5], "info", c("title", "doi")) |>
  knitr::kable()

title	doi	paper_id
Mirror neurons, originally discovered in macaque monkeys using single-cell recordings, are active when an animal is either performing a particular action or observing another agent performing the same or a similar action (di	10.1177/0956797613520608	0956797613520608
Beyond Gist: Strategic and Incremental Information Accumulation for Scene Categorization	10.1177/0956797614522816	0956797614522816
Serotonin and Social Norms: Tryptophan Depletion Impairs Social Comparison and Leads to Resource Depletion in a Multiplayer Harvesting Game	10.1177/0956797614527830	0956797614527830
Action-Specific Disruption of Perceptual Confidence	10.1177/0956797614557697	0956797614557697
Emotional Vocalizations Are Recognized Across Cultures Regardless of the Valence of Distractors	10.1177/0956797614560771	0956797614560771

Most functions and every module accept either a single paper or a paperlist, so the same code scales from one paper to a whole corpus. See Using the Paper Database Corpora for the available corpora.