pdf_file <- demofile("pdf")
json_file <- convert(file_path = pdf_file,
save_path = "converted",
crossref_lookup = TRUE)3 Reading in a Paper
The first step to using Metacheck is to read in a manuscript. Currently, Metacheck relies on GROBID, which works with PDF files. If you have a manuscript in a different format, save a copy as a PDF. You have two options to convert a PDF to the GROBID .xml format. First, you can use a local version of GROBID. We will explain this below, but it requires more technical knowledge, and sufficient space on your harddrive. Alternatively, you can use an online server. As this is much simpler, and you can use a GROBID server at Eindhoven University of Technology that is GDPRR compliant and does not store any data. This is the approach we will use by default in this manual. Nevertheless, you might prefer not to send your manuscript to an external server, or you might not be allowed to. We have developed Metacheck to always work locally, without an internet connection, to also make it available in these more restrictive situations.
The function convert() can read PDF files and save them in JSON format. This requires an internet connection and takes a few seconds per paper. The conversion process needs to be done only once and the converted file can be saved for later use. If you don’t supply an api_url, Metacheck will use a list of active servers and choose the first available. By default, this is a GROBID server at Eindhoven University of Technology that is GDPR compliant.
You can also supplement the reference section created by GROBID with information retrieved from CrossRef by setting crossref_lookup = TRUE. This can increase the accuracy of the information in the reference list. Retrieving this information also requires an internet connection, and can take some time. You can always add this information later with the add_bibmatch() function.
3.1 Local conversion through GROBID in Docker
You can set up your own local GROBID server following instructions from https://grobid.readthedocs.io/. The easiest way is to use Docker. The following code installs GROBID 0.9.0 if you run it in your system terminal.
docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.9.0If you set the api_url to the local path http://localhost:8070 (as illustrated below) the convert() function will default to the local GROBID server. Metacheck will by default use the local GROBID server if it is automatically detected.
json_file <- convert(file_path = pdf_file,
save_path = "converted",
method = "grobid",
api_url = "http://localhost:8070")3.2 Load from JSON or XML
The function read() reads in papers created by convert(). It accepts both the bibr JSON files and the GROBID XML (TEI) files that conversion produces, so you can use whichever you have.
# read a converted JSON file
paper <- read(json_file)# read a GROBID XML file directly
paper_xml <- read(demofile("xml"))3.3 Batch Processing
The functions convert() and read() also work on a folder of files, or a vector of paths, returning a list of JSON file paths or paper objects, respectively. Batch processing is especially useful for scientists who want to be able to examine a large number of manuscripts. For example, a metascientist might be interesting in examining the uptake of open science practices over time, and run Metacheck across all articles published in a scientific journal. As part of Metacheck, we have created a paper repository of more than 15.000 open access articles that you can read in through the function papers_load(). See Using the Paper Database Corpora for how to load the available corpora.
When a folder contains both a .json and a .xml file for the same paper, read() uses the JSON and skips the duplicate XML.
3.4 The paper object
read() returns a paper object: a structured representation of the manuscript, stored as a list of tables. Everything Metacheck does — searching text, running modules, checking references — operates on this object, so it is worth understanding what is inside it and how to get at each piece.
A paper object has the class scivrs_paper, and you access its parts with the $ operator. Here are all the components:
class(paper)#> [1] "scivrs_paper" "list"
names(paper)#> [1] "paper_id" "info" "author" "text" "section" "url"
#> [7] "bib" "xref" "figure" "table" "eq" "bib_match"
The components are summarised below; the rest of this section shows each one. Most are data frames, so you can work with them using base R or the tidyverse.
| Component | What it holds |
|---|---|
paper_id |
A single string identifying the paper |
info |
One row of paper-level metadata (title, DOI, type, …) |
author |
One row per author |
text |
One row per sentence |
section |
One row per heading/section |
url |
Links found in the text |
bib |
The reference list |
xref |
Cross-references linking sentences to bib entries, tables, and figures |
figure |
Figures |
table |
Tables |
eq |
Extracted statistics and equations |
bib_match |
Reference metadata matched from CrossRef and other services |
paper_id
A single string identifying the paper (the user-supplied id, the DOI, or the filename):
paper$paper_id#> [1] "to_err_is_human"
info
A one-row data frame of paper-level metadata. Use t() to read it as a column:
| value | |
|---|---|
| title | To Err is Human: An Empirical Investigation |
| keywords | NA |
| doi | 10.32614/10.5281/zenodo.2669586 |
| file_hash | 62ede2964b174f6d |
| input_format | grobid 0.9.0 |
| file_name | /Users/debruine/rproj/scienceverse/metacheck/inst/demos/to_err_is_human.xml |
| bibr_version | 10.0 |
| paper_type | unknown |
| paper_type_confidence | 0 |
| oecd_l1 | NA |
| oecd_l2 | NA |
| oecd_confidence | NA |
Pull out a single field directly:
paper$info$title#> [1] "To Err is Human: An Empirical Investigation"
paper$info$doi#> [1] "10.32614/10.5281/zenodo.2669586"
author
One row per author, in paper order:
| given | family | affiliation | corresponding |
|---|---|---|---|
| Daniel | Lakens | FALSE | |
| Lisa | Debruine | FALSE | |
| Jakub | Werner | FALSE |
text
One row per sentence (excluding the title, headings, and references). This is the table that text_search() searches.
nrow(paper$text)#> [1] 37
| text | section_id | paragraph_id |
|---|---|---|
| This paper demonstrates some good and poor practices for use with the {metacheck} R package. | 0 | 1 |
| All data are simulated. | 0 | 1 |
| The paper shows examples of (1) open and closed OSF links; (2a) citation of retracted papers, (2b) citations without a doi, (2c) citations with Pubpeer comments, (2d) citations in the FLoRA replication database, and (2e) missing/mismatched/incorrect citations and references; (3a) R files with code on GitHub that do not load libraries in one location, (3b) load files that are not shared in the repository, (3c) lack comments, and (3d) have absolute file paths; (4) imprecise reporting of non-significant p-values; (5) tests with and without effect sizes; (6) use of “marginally significant” to describe non-significant findings; (7) a power analysis reporting some of the essential attributes; and (8) retrieving information from preregistrations. | 0 | 1 |
| Although intentional dishonesty might be a successful way to boost creativity (Gino and Wiltermuth 2014), it is safe to say most mistakes researchers make are unintentional. | 1 | 2 |
| From a human factors perspective, human error is a symptom of a poor design (Smithy, 2020). | 1 | 2 |
| Automation can be used to check for errors in scientific manuscripts, and inform authors about possible corrections. | 1 | 2 |
section
One row per heading. section_type is Metacheck’s best guess at the kind of section, which modules use to focus on (say) only the results:
| section_id | header | section_type |
|---|---|---|
| 0 | Abstract | abstract |
| 1 | [div-01] | intro |
| 2 | Method | method |
| 3 | Procedure | method |
| 4 | Discussion | discussion |
| 5 | Data Availability | availability |
| 6 | Author Notes | contribution |
| 7 | Power Analysis | annex |
| 8 | Results | annex |
| 9 | Figure 2 : | figure |
| 10 | Table 1 : | table |
| 11 | Figure 1 : | figure |
| 12 | foot | |
| 13 | References | references |
url
Every external link found in the text, with the sentence (text_id) it appeared in:
paper$url |>
knitr::kable()| href | link_text | text_id |
|---|---|---|
| https://osf.io/48ncu | NA | 9 |
| https://aspredicted.org/by8i8v.pdf | NA | 9 |
| https://github.com/Lak- | NA | 20 |
| https://researchbox.org/4377 | NA | 20 |
| https://osf.io/5tbm9 | NA | 21 |
| https://osf.io/629bx | NA | 21 |
bib
The reference list, parsed into columns:
| bib_id | authors | year | title |
|---|---|---|---|
| 0 | Debruine, Lisa | 2025 | Faux: Simulation for Factorial Designs |
| 1 | Eagly, Alice H; Wood, Wendy | 1999 | The Origins of Sex Differences in Human Behavior: Evolved Dispositions Versus Social Roles |
| 2 | Gino, Francesca; Wiltermuth, Scott S | 2014 | Evil Genius? How Dishonesty Can Lead to Greater Creativity |
| 3 | Lakens, Daniël | 2018 | Equivalence Testing for Psychological Research |
| 4 | Smith, F | 2021 | Human Error Is a Symptom of a Poor Design |
xref
Cross-references that link a sentence (text_id) to an item it cites — a bib entry, table, or figure (xref_type):
paper$xref |>
knitr::kable()| xref_id | xref_type | contents | text_id |
|---|---|---|---|
| 2 | bibr | (Gino and Wiltermuth 2014) | 4 |
| 0 | foot | foot_0 | 8 |
| 0 | figure | 2 | 17 |
| NA | table | 1 | 25 |
| 1 | figure | 1 | 25 |
figure and table
Figures and tables each get a row, linked to the section table by section_id:
| figure_id | section_id | page_number |
|---|---|---|
| 1 | 9 | NA |
| 2 | 11 | NA |
| table_id | section_id | page_number |
|---|---|---|
| 1 | 10 | NA |
eq
Statistics and equations extracted from the text, broken into a left-hand side, degrees of freedom, comparator, and right-hand side. Many statistics modules read from this table rather than re-scanning the text:
| text_id | grp_id | lhs | df | comp | rhs |
|---|---|---|---|---|---|
| 15 | 1 | M | NA | = | 9.12 |
| 15 | 1 | M | NA | = | 10.9 |
| 15 | 1 | t | (97.7) | = | 2.9 |
| 15 | 1 | p | NA | = | 0.005 |
| 15 | 1 | d | NA | = | 0.59 |
| 16 | 2 | M | NA | = | 5.06 |
bib_match
When you convert with crossref_lookup = TRUE (or run a reference module), Metacheck matches each reference against services like CrossRef and stores the retrieved metadata here, keyed by bib_id:
| bib_id | service | doi | title |
|---|---|---|---|
| 1 | crossref | 10.1037/0003-066x.54.6.408 | The origins of sex differences in human behavior: Evolved dispositions versus social roles. |
| 2 | crossref | 10.1177/0956797614520714 | Retracted: Evil Genius? How Dishonesty Can Lead to Greater Creativity |
| 3 | crossref | 10.1177/2515245918770963 | Equivalence Testing for Psychological Research: A Tutorial |
3.5 Working with a list of papers
A paperlist is simply a list of paper objects — for example the built-in psychsci dataset (250 open-access Psychological Science articles). To pull the same component from every paper into one combined table, use paper_table():
| title | doi | paper_id |
|---|---|---|
| Mirror neurons, originally discovered in macaque monkeys using single-cell recordings, are active when an animal is either performing a particular action or observing another agent performing the same or a similar action (di | 10.1177/0956797613520608 | 0956797613520608 |
| Beyond Gist: Strategic and Incremental Information Accumulation for Scene Categorization | 10.1177/0956797614522816 | 0956797614522816 |
| Serotonin and Social Norms: Tryptophan Depletion Impairs Social Comparison and Leads to Resource Depletion in a Multiplayer Harvesting Game | 10.1177/0956797614527830 | 0956797614527830 |
| Action-Specific Disruption of Perceptual Confidence | 10.1177/0956797614557697 | 0956797614557697 |
| Emotional Vocalizations Are Recognized Across Cultures Regardless of the Valence of Distractors | 10.1177/0956797614560771 | 0956797614560771 |
Most functions and every module accept either a single paper or a paperlist, so the same code scales from one paper to a whole corpus. See Using the Paper Database Corpora for the available corpora.
