3  Reading in a Paper

The first step to using Metacheck is to read in a manuscript. Currently, Metacheck relies on GROBID, which works with PDF files. If you have a manuscript in a different format, save a copy as a PDF. You have two options to convert a PDF to the GROBID .xml format. First, you can use a local version of GROBID. We will explain this below, but it requires more technical knowledge, and sufficient space on your harddrive. Alternatively, you can use an online server. As this is much simpler, and you can use a GROBID server at Eindhoven University of Technology that is GDPRR compliant and does not store any data. This is the approach we will use by default in this manual. Nevertheless, you might prefer not to send your manuscript to an external server, or you might not be allowed to. We have developed Metacheck to always work locally, without an internet connection, to also make it available in these more restrictive situations.

The function convert() can read PDF files and save them in JSON format. This requires an internet connection and takes a few seconds per paper. The conversion process needs to be done only once and the converted file can be saved for later use. If you don’t supply an api_url, Metacheck will use a list of active servers and choose the first available. By default, this is a GROBID server at Eindhoven University of Technology that is GDPR compliant.

You can also supplement the reference section created by GROBID with information retrieved from CrossRef by setting crossref_lookup = TRUE. This can increase the accuracy of the information in the reference list. Retrieving this information also requires an internet connection, and can take some time. You can always add this information later with the add_bibmatch() function.

pdf_file <- demofile("pdf")
json_file <- convert(file_path = pdf_file,
                     save_path = "converted",
                     crossref_lookup = TRUE)

3.1 Local conversion through GROBID in Docker

You can set up your own local GROBID server following instructions from https://grobid.readthedocs.io/. The easiest way is to use Docker. The following code installs GROBID 0.9.0 if you run it in your system terminal.

docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.9.0

If you set the api_url to the local path http://localhost:8070 (as illustrated below) the convert() function will default to the local GROBID server. Metacheck will by default use the local GROBID server if it is automatically detected.

json_file <- convert(file_path = pdf_file,
                     save_path = "converted",
                     method = "grobid",
                     api_url = "http://localhost:8070")

3.2 Load from JSON or XML

The function read() reads in papers created by convert(). It accepts both the bibr JSON files and the GROBID XML (TEI) files that conversion produces, so you can use whichever you have.

# read a converted JSON file
paper <- read(json_file)
# read a GROBID XML file directly
paper_xml <- read(demofile("xml"))

3.3 Batch Processing

The functions convert() and read() also work on a folder of files, or a vector of paths, returning a list of JSON file paths or paper objects, respectively. Batch processing is especially useful for scientists who want to be able to examine a large number of manuscripts. For example, a metascientist might be interesting in examining the uptake of open science practices over time, and run Metacheck across all articles published in a scientific journal. As part of Metacheck, we have created a paper repository of more than 15.000 open access articles that you can read in through the function papers_load(). See Using the Paper Database Corpora for how to load the available corpora.

When a folder contains both a .json and a .xml file for the same paper, read() uses the JSON and skips the duplicate XML.

3.4 The paper object

read() returns a paper object: a structured representation of the manuscript, stored as a list of tables. Everything Metacheck does — searching text, running modules, checking references — operates on this object, so it is worth understanding what is inside it and how to get at each piece.

A paper object has the class scivrs_paper, and you access its parts with the $ operator. Here are all the components:

class(paper)
#> [1] "scivrs_paper" "list"
names(paper)
#>  [1] "paper_id"  "info"      "author"    "text"      "section"   "url"      
#>  [7] "bib"       "xref"      "figure"    "table"     "eq"        "bib_match"

The components are summarised below; the rest of this section shows each one. Most are data frames, so you can work with them using base R or the tidyverse.

Component What it holds
paper_id A single string identifying the paper
info One row of paper-level metadata (title, DOI, type, …)
author One row per author
text One row per sentence
section One row per heading/section
url Links found in the text
bib The reference list
xref Cross-references linking sentences to bib entries, tables, and figures
figure Figures
table Tables
eq Extracted statistics and equations
bib_match Reference metadata matched from CrossRef and other services

paper_id

A single string identifying the paper (the user-supplied id, the DOI, or the filename):

paper$paper_id
#> [1] "to_err_is_human"

info

A one-row data frame of paper-level metadata. Use t() to read it as a column:

t(paper$info) |>
  knitr::kable(col.names = "value")
value
title To Err is Human: An Empirical Investigation
keywords NA
doi 10.32614/10.5281/zenodo.2669586
file_hash 62ede2964b174f6d
input_format grobid 0.9.0
file_name /Users/debruine/rproj/scienceverse/metacheck/inst/demos/to_err_is_human.xml
bibr_version 10.0
paper_type unknown
paper_type_confidence 0
oecd_l1 NA
oecd_l2 NA
oecd_confidence NA

Pull out a single field directly:

paper$info$title
#> [1] "To Err is Human: An Empirical Investigation"
paper$info$doi
#> [1] "10.32614/10.5281/zenodo.2669586"

author

One row per author, in paper order:

paper$author[, c("given", "family", "affiliation", "corresponding")] |>
  knitr::kable()
given family affiliation corresponding
Daniel Lakens FALSE
Lisa Debruine FALSE
Jakub Werner FALSE

text

One row per sentence (excluding the title, headings, and references). This is the table that text_search() searches.

nrow(paper$text)
#> [1] 37
head(paper$text[, c("text", "section_id", "paragraph_id")]) |>
  knitr::kable()
text section_id paragraph_id
This paper demonstrates some good and poor practices for use with the {metacheck} R package. 0 1
All data are simulated. 0 1
The paper shows examples of (1) open and closed OSF links; (2a) citation of retracted papers, (2b) citations without a doi, (2c) citations with Pubpeer comments, (2d) citations in the FLoRA replication database, and (2e) missing/mismatched/incorrect citations and references; (3a) R files with code on GitHub that do not load libraries in one location, (3b) load files that are not shared in the repository, (3c) lack comments, and (3d) have absolute file paths; (4) imprecise reporting of non-significant p-values; (5) tests with and without effect sizes; (6) use of “marginally significant” to describe non-significant findings; (7) a power analysis reporting some of the essential attributes; and (8) retrieving information from preregistrations. 0 1
Although intentional dishonesty might be a successful way to boost creativity (Gino and Wiltermuth 2014), it is safe to say most mistakes researchers make are unintentional. 1 2
From a human factors perspective, human error is a symptom of a poor design (Smithy, 2020). 1 2
Automation can be used to check for errors in scientific manuscripts, and inform authors about possible corrections. 1 2

section

One row per heading. section_type is Metacheck’s best guess at the kind of section, which modules use to focus on (say) only the results:

paper$section[, c("section_id", "header", "section_type")] |>
  knitr::kable()
section_id header section_type
0 Abstract abstract
1 [div-01] intro
2 Method method
3 Procedure method
4 Discussion discussion
5 Data Availability availability
6 Author Notes contribution
7 Power Analysis annex
8 Results annex
9 Figure 2 : figure
10 Table 1 : table
11 Figure 1 : figure
12 foot
13 References references

url

Every external link found in the text, with the sentence (text_id) it appeared in:

paper$url |>
  knitr::kable()
href link_text text_id
https://osf.io/48ncu NA 9
https://aspredicted.org/by8i8v.pdf NA 9
https://github.com/Lak- NA 20
https://researchbox.org/4377 NA 20
https://osf.io/5tbm9 NA 21
https://osf.io/629bx NA 21

bib

The reference list, parsed into columns:

paper$bib[, c("bib_id", "authors", "year", "title")] |>
  knitr::kable()
bib_id authors year title
0 Debruine, Lisa 2025 Faux: Simulation for Factorial Designs
1 Eagly, Alice H; Wood, Wendy 1999 The Origins of Sex Differences in Human Behavior: Evolved Dispositions Versus Social Roles
2 Gino, Francesca; Wiltermuth, Scott S 2014 Evil Genius? How Dishonesty Can Lead to Greater Creativity
3 Lakens, Daniël 2018 Equivalence Testing for Psychological Research
4 Smith, F 2021 Human Error Is a Symptom of a Poor Design

xref

Cross-references that link a sentence (text_id) to an item it cites — a bib entry, table, or figure (xref_type):

paper$xref |>
  knitr::kable()
xref_id xref_type contents text_id
2 bibr (Gino and Wiltermuth 2014) 4
0 foot foot_0 8
0 figure 2 17
NA table 1 25
1 figure 1 25

figure and table

Figures and tables each get a row, linked to the section table by section_id:

paper$figure[, c("figure_id", "section_id", "page_number")] |>
  knitr::kable()
figure_id section_id page_number
1 9 NA
2 11 NA
paper$table[, c("table_id", "section_id", "page_number")] |>
  knitr::kable()
table_id section_id page_number
1 10 NA

eq

Statistics and equations extracted from the text, broken into a left-hand side, degrees of freedom, comparator, and right-hand side. Many statistics modules read from this table rather than re-scanning the text:

head(paper$eq) |>
  knitr::kable()
text_id grp_id lhs df comp rhs
15 1 M NA = 9.12
15 1 M NA = 10.9
15 1 t (97.7) = 2.9
15 1 p NA = 0.005
15 1 d NA = 0.59
16 2 M NA = 5.06

bib_match

When you convert with crossref_lookup = TRUE (or run a reference module), Metacheck matches each reference against services like CrossRef and stores the retrieved metadata here, keyed by bib_id:

paper$bib_match[, c("bib_id", "service", "doi", "title")] |>
  knitr::kable()
bib_id service doi title
1 crossref 10.1037/0003-066x.54.6.408 The origins of sex differences in human behavior: Evolved dispositions versus social roles.
2 crossref 10.1177/0956797614520714 Retracted: Evil Genius? How Dishonesty Can Lead to Greater Creativity
3 crossref 10.1177/2515245918770963 Equivalence Testing for Psychological Research: A Tutorial

3.5 Working with a list of papers

A paperlist is simply a list of paper objects — for example the built-in psychsci dataset (250 open-access Psychological Science articles). To pull the same component from every paper into one combined table, use paper_table():

paper_table(psychsci[1:5], "info", c("title", "doi")) |>
  knitr::kable()
title doi paper_id
Mirror neurons, originally discovered in macaque monkeys using single-cell recordings, are active when an animal is either performing a particular action or observing another agent performing the same or a similar action (di 10.1177/0956797613520608 0956797613520608
Beyond Gist: Strategic and Incremental Information Accumulation for Scene Categorization 10.1177/0956797614522816 0956797614522816
Serotonin and Social Norms: Tryptophan Depletion Impairs Social Comparison and Leads to Resource Depletion in a Multiplayer Harvesting Game 10.1177/0956797614527830 0956797614527830
Action-Specific Disruption of Perceptual Confidence 10.1177/0956797614557697 0956797614557697
Emotional Vocalizations Are Recognized Across Cultures Regardless of the Valence of Distractors 10.1177/0956797614560771 0956797614560771

Most functions and every module accept either a single paper or a paperlist, so the same code scales from one paper to a whole corpus. See Using the Paper Database Corpora for the available corpora.