Installation
You can install the development version of metacheck from GitHub with:
# install.packages("devtools")
devtools::install_github("scienceverse/metacheck")You can launch a shiny app, but this has limited features and is under development.
metacheck::metacheck_app()Load from PDF
The function convert() can read PDF files and save them
in JSON
format. This requires an internet connection and takes a few seconds
per paper, so should only be done once and the results saved for later
use.
You can set up your own local grobid server following instructions from https://grobid.readthedocs.io/. The easiest way is to use Docker.
Then you can set your api_url to the local path http://localhost:8070.
json_file <- convert(file_path = pdf_file,
save_path = "converted",
method = "grobid",
api_url = "http://localhost:8070")Load from non-PDF document
To take advantage of grobid’s ability to parse references and other aspects of papers, for now the best way is to convert your papers to PDF. We will introduce our custom backend, bibr, soon and this will be able to convert DOC and DOCX files directly.
Batch Processing
The functions convert() and read() also
work on a folder of files, returning a list of JSON file paths or paper
objects, respectively. The functions text_search(),
text_expand() and llm() also work on a list of
paper objects.
Paper Components
Paper objects contain a lot of structured information, including info, references, and citations.
Info
paper$info#> title keywords
#> 1 To Err is Human: An Empirical Investigation NA
#> doi file_hash input_format
#> 1 10.32614/10.5281/zenodo.2669586 62ede2964b174f6d grobid 0.9.0
#> file_name
#> 1 /Users/debruine/rproj/scienceverse/metacheck/inst/demos/to_err_is_human.xml
#> bibr_version paper_type paper_type_confidence oecd_l1 oecd_l2 oecd_confidence
#> 1 10.0 unknown 0 <NA> <NA> NA
Bibliography
The bibliography is provided in a tabular format.
paper$bib| bib_type | doi | title | authors | editors | publisher | year | volume | issue | container | first_page | last_page | bib_id | year_suffix | text_id |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| article | 10.32614/10.5281/zenodo.2669586 | Faux: Simulation for Factorial Designs | Debruine, Lisa | 2025 | Zenodo | NA | NA | 0 | 33 | |||||
| article | 10.1037/0003-066x.54.6.408 | The Origins of Sex Differences in Human Behavior: Evolved Dispositions Versus Social Roles | Eagly, Alice H; Wood, Wendy | 1999 | 54 | 6 | American Psychologist | 408 | 423 | 1 | 34 | |||
| article | 10.1177/0956797614520714 | Evil Genius? How Dishonesty Can Lead to Greater Creativity | Gino, Francesca; Wiltermuth, Scott S | 2014 | 25 | 4 | Psychological Science | 973 | 981 | 2 | 35 | |||
| article | Equivalence Testing for Psychological Research | Lakens, Daniël | 2018 | 1 | Psychological Science | 259 | 270 | 3 | 36 | |||||
| article | 10.0000/0123456789 | Human Error Is a Symptom of a Poor Design | Smith, F | 2021 | Journal of Journals | NA | NA | 4 | 37 |
Cross-References
Cross-references are also provided in a tabular format, with
xref_id to match the bibliography table.
paper$xref| xref_id | xref_type | contents | text_id |
|---|---|---|---|
| 2 | bibr | (Gino and Wiltermuth 2014) | 4 |
| 0 | foot | foot_0 | 8 |
| 0 | figure | 2 | 17 |
| NA | table | 1 | 25 |
| 1 | figure | 1 | 25 |
Batch
The psychsci built-in dataset contains all 250
open-access articles published in Psychological Science from
2004 to 2014. Psychological Science is the flagship journal of
the Association for Psychological Science and publishes empirical
research from all subfields of psychology. The PDFs were converted to
metacheck paper objects using convert() and stored in the
package. This dataset is used throughout the documentation to
demonstrate how to work with a list of papers.
There are functions to combine the information from a list of papers,
like psychsci.
paper_table(psychsci[1:5], "info", c("title", "doi"))#> # A tibble: 5 × 3
#> title doi paper_id
#> <chr> <chr> <chr>
#> 1 Mirror neurons, originally discovered in macaque monkeys using… 10.1… 0956797…
#> 2 Beyond Gist: Strategic and Incremental Information Accumulatio… 10.1… 0956797…
#> 3 Serotonin and Social Norms: Tryptophan Depletion Impairs Socia… 10.1… 0956797…
#> 4 Action-Specific Disruption of Perceptual Confidence 10.1… 0956797…
#> 5 Emotional Vocalizations Are Recognized Across Cultures Regardl… 10.1… 0956797…
paper_table(psychsci[1:5], "bib") |>
dplyr::filter(!is.na(doi), doi != "")#> # A tibble: 6 × 16
#> bib_type doi title authors editors publisher year volume issue first_page
#> <chr> <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr> <chr>
#> 1 article 10.338… The … Zylber… "" "" 2012 6 "" NA
#> 2 article 10.103… Stro… Ekman,… "" "" 1994 115 "" 268
#> 3 article 10.117… Cult… Gendro… "" "" 2014 25 "" 911
#> 4 article 10.103… Is t… Russel… "" "" 1994 115 "" 102
#> 5 article 10.108… Perc… Sauter… "" "" 2010 63 "" 2251
#> 6 article 10.107… Cros… Sauter… "" "" 2010 107 "" 2408
#> # ℹ 6 more variables: last_page <chr>, container <chr>, bib_id <int>,
#> # year_suffix <chr>, text_id <int>, paper_id <chr>
Search Text
You can access a parsed table of the full text of the paper via
paper$text, but you may find it more convenient to use the
function text_search(). The defaults return a data table of
each sentence, with the section type, header, div, paragraph and
sentence numbers, and file name. (The section type is a best guess from
the headers, so may not always be accurate.)
text <- text_search(paper)| text | text_id | paragraph_id | section_id | page_number | formatted | paper_id | header | section_type |
|---|---|---|---|---|---|---|---|---|
| This paper demonstrates some good and poor practices for use with the {metacheck} R package. | 1 | 1 | 0 | NA | NA | to_err_is_human | Abstract | abstract |
| All data are simulated. | 2 | 1 | 0 | NA | NA | to_err_is_human | Abstract | abstract |
| The paper shows examples of (1) open and closed OSF links; (2a) citation of retracted papers, (2b) citations without a doi, (2c) citations with Pubpeer comments, (2d) citations in the FLoRA replication database, and (2e) missing/mismatched/incorrect citations and references; (3a) R files with code on GitHub that do not load libraries in one location, (3b) load files that are not shared in the repository, (3c) lack comments, and (3d) have absolute file paths; (4) imprecise reporting of non-significant p-values; (5) tests with and without effect sizes; (6) use of “marginally significant” to describe non-significant findings; (7) a power analysis reporting some of the essential attributes; and (8) retrieving information from preregistrations. | 3 | 1 | 0 | NA | NA | to_err_is_human | Abstract | abstract |
| Although intentional dishonesty might be a successful way to boost creativity (Gino and Wiltermuth 2014), it is safe to say most mistakes researchers make are unintentional. | 4 | 2 | 1 | NA | Although intentional dishonesty might be a successful way to boost creativity (Gino and Wiltermuth 2014), it is safe to say most mistakes researchers make are unintentional. | to_err_is_human | [div-01] | intro |
| From a human factors perspective, human error is a symptom of a poor design (Smithy, 2020). | 5 | 2 | 1 | NA | NA | to_err_is_human | [div-01] | intro |
| Automation can be used to check for errors in scientific manuscripts, and inform authors about possible corrections. | 6 | 2 | 1 | NA | NA | to_err_is_human | [div-01] | intro |
Pattern
You can search for a specific word or phrase by setting the
pattern argument. The pattern is a regex string by default;
set fixed = TRUE if you want to find exact text
matches.
text <- text_search(paper, pattern = "metacheck")| text | text_id | paragraph_id | section_id | page_number | formatted | paper_id | header | section_type |
|---|---|---|---|---|---|---|---|---|
| This paper demonstrates some good and poor practices for use with the {metacheck} R package. | 1 | 1 | 0 | NA | NA | to_err_is_human | Abstract | abstract |
| In this study we examine the usefulness of metacheck to improve best practices. | 7 | 2 | 1 | NA | NA | to_err_is_human | [div-01] | intro |
Return
Set return to one of “sentence”, “paragraph”, “section”,
or “match” to control what gets returned.
text <- text_search(paper, "GitHub",
return = "paragraph")| text | text_id | paragraph_id | section_id | page_number | formatted | paper_id | header | section_type |
|---|---|---|---|---|---|---|---|---|
| This paper demonstrates some good and poor practices for use with the {metacheck} R package. All data are simulated. The paper shows examples of (1) open and closed OSF links; (2a) citation of retracted papers, (2b) citations without a doi, (2c) citations with Pubpeer comments, (2d) citations in the FLoRA replication database, and (2e) missing/mismatched/incorrect citations and references; (3a) R files with code on GitHub that do not load libraries in one location, (3b) load files that are not shared in the repository, (3c) lack comments, and (3d) have absolute file paths; (4) imprecise reporting of non-significant p-values; (5) tests with and without effect sizes; (6) use of “marginally significant” to describe non-significant findings; (7) a power analysis reporting some of the essential attributes; and (8) retrieving information from preregistrations. | NA | 1 | 0 | NA | NA | to_err_is_human | Abstract | abstract |
| Data and analysis code is available on GitHub from https://github.com/Lak-ens/to_err_is_human and from https://researchbox.org/4377. | NA | 8 | 5 | NA | NA | to_err_is_human | Data Availability | availability |
Regex matches
You can also return just the matched text from a regex search by
setting return = "match". The extra ...
arguments in text_search() are passed to
grep(), so perl = TRUE allows you to use more
complex regex, like below.
pattern <- "[a-zA-Z]\\S*\\s*(=|<)\\s*[0-9\\.,-]*\\d"
text <- text_search(paper, pattern, return = "match", perl = TRUE)| text | text_id | paragraph_id | section_id | page_number | formatted | paper_id | header | section_type |
|---|---|---|---|---|---|---|---|---|
| M = 9.12 | 15 | 4 | 3 | NA | NA | to_err_is_human | Procedure | method |
| M = 10.9 | 15 | 4 | 3 | NA | NA | to_err_is_human | Procedure | method |
| t(97.7) = 2.9 | 15 | 4 | 3 | NA | NA | to_err_is_human | Procedure | method |
| p = 0.005 | 15 | 4 | 3 | NA | NA | to_err_is_human | Procedure | method |
| d =0.59 | 15 | 4 | 3 | NA | NA | to_err_is_human | Procedure | method |
| M = 5.06 | 16 | 5 | 3 | NA | NA | to_err_is_human | Procedure | method |
| M = 4.5 | 16 | 5 | 3 | NA | NA | to_err_is_human | Procedure | method |
| t(97.2) = -1.96 | 16 | 5 | 3 | NA | NA | to_err_is_human | Procedure | method |
| p =0.152 | 16 | 5 | 3 | NA | NA | to_err_is_human | Procedure | method |
| N = 50 | 24 | 11 | 7 | NA | NA | to_err_is_human | Power Analysis | annex |
| pwr::pwr.t.test(n = 50 | 27 | 14 | 8 | NA | NA | to_err_is_human | Results | annex |
| power = 0.8 | 27 | 14 | 8 | NA | NA | to_err_is_human | Results | annex |
Expand Text
You can expand the text returned by text_search() or a
module with text_expand().
marginal <- text_search(paper, "marginal") |>
text_expand(paper, plus = 1, minus = 1)
marginal[, c("text", "expanded")]#> # A tibble: 2 × 2
#> text expanded
#> <chr> <chr>
#> 1 "The paper shows examples of (1) open and closed OSF links; (2a) cit… "All da…
#> 2 "On average researchers in the experimental condition found the app … "On ave…
Large Language Models
LLM use is entirely optional. Metacheck follows the principle that AI should be opt-in, restricted to classification of existing text, and used only where it provides clear benefits that cannot be achieved with other methods such as regular expressions. The vast majority of modules work without any LLM at all.
Currently, the only module that will unlock substantial extra
functionality when an lmm is used is the power
module, which can read sentences about power analyses and
extract structured information (test type, sample size, effect size,
etc.) that would be difficult to capture reliably with regular
expressions alone.
Option 1: Run a model locally with Ollama (recommended)
The recommended approach is to run an AI model on your own computer using Ollama, a free open-source tool. This means:
- No API key or account required
- No data leaves your computer
- No usage costs or rate limits
The trade-off is speed — your computer does the processing, so it is slower than a cloud API, especially the first time a model loads. See the dedicated Local AI with Ollama article for full setup instructions. In brief:
- Download and install Ollama from https://ollama.com/download
- Pull a model, e.g.
ollama pull llama3.2in a terminal - In R, set metacheck to use it:
Option 2: Use a cloud API
You can also use any model supported by ellmer via a cloud API. This is faster but requires an account and API key with a provider, and sends text to an external service.
Get an API key from your preferred provider (e.g. https://console.groq.com/keys) and add it to your
.Renviron file (use usethis::edit_r_environ()
to open it):
When metacheck starts it checks for API keys in
.Renviron and sets the model automatically. You can also
set it manually:
llm_model() # check which model is currently set
llm_model("groq") # use ellmer's default Groq model
llm_model("groq/llama-3.3-70b-versatile") # use a specific modelA list of available models for a provider:
| platform | id | object | owned_by | context_window | max_completion_tokens | created_at |
|---|---|---|---|---|---|---|
| groq | meta-llama/llama-prompt-guard-2-22m | model | Meta | 512 | 512 | 2025-05-30 |
| groq | llama-3.1-8b-instant | model | Meta | 131072 | 131072 | 2023-09-03 |
| groq | allam-2-7b | model | SDAIA | 4096 | 4096 | 2025-01-23 |
| groq | openai/gpt-oss-120b | model | OpenAI | 131072 | 65536 | 2025-08-05 |
| groq | meta-llama/llama-prompt-guard-2-86m | model | Meta | 512 | 512 | 2025-05-30 |
LLM Queries
You can query the extracted text of papers with LLMs. See
?llm for details of how to get and set up your API key,
choose an LLM, and adjust settings.
Use text_search() first to narrow down the text into
what you want to query. Below, we limited search to the first ten
papers, and returned sentences that contains the word “power” and at
least one number. Then we asked an LLM to determine if this is an a
priori power analysis, and if so, to return some relevant values in a
JSON-structured format.
power <- psychsci[1:10] |>
# sentences containing the word power
text_search("power") |>
# and containing at least one number
text_search("[0-9]")
# ask a specific question with specific response format
system_prompt <- 'Does this sentence report an a priori power analysis? If so, return the test, sample size, critical alpha criterion, power level, effect size and effect size metric plus any other relevant parameters, in JSON format like:
{
"apriori": true,
"test": "paired samples t-test",
"sample": 20,
"alpha": 0.05,
"power": 0.8,
"es": 0.4,
"es_metric": "cohen\'s D"
}
If not, return {"apriori": false}
Answer only in valid JSON format, starting with { and ending with }.'
llm_power <- llm(power, system_prompt)Expand JSON
It is useful to ask an LLM to return data in JSON structured format,
but can be frustrating to extract the data, especially where the LLM
makes syntax mistakes. The function json_expand() tries to
expand a column with a JSON-formatted response into columns and deals
with it gracefully (sets an ‘error’ column to “parsing error”) if there
are errors. It also fixes column data types, if possible.
llm_response <- json_expand(llm_power, "answer") |>
dplyr::select(text, apriori:es_metric)| text | apriori | test | sample | alpha | power | es | es_metric |
|---|---|---|---|---|---|---|---|
| It is possible that less-consistent effects were observed on trials with errors because of reduced power to detect an effect on these trials, which by design were less numerous (~25%). | FALSE | NA | NA | NA | NA | NA | NA |
| Figure 1 shows that CY had very little predictive power for CLIM, but the fit in the transposed plot has an obvious bell-shaped curve. | FALSE | NA | NA | NA | NA | NA | NA |
| Sample size was calculated with an a priori power analysis, using the effect sizes reported by Küpper et al. (2014), who used identical procedures, materials, and dependent measures. | TRUE | NA | NA | NA | NA | NA | NA |
| We determined that a minimum sample size of 7 per group would be necessary for 95% power to detect an effect. | TRUE | t-test | 7 | 0.050 | 0.95 | NA | NA |
| For the first part of the task, 11 static visual images, one from each of the scenes in the film were presented once each on a black background for 2 s using Power-Point. | FALSE | NA | NA | NA | NA | NA | NA |
| A sample size of 26 per group was required to ensure 80% power to detect this difference at the 5% significance level. | TRUE | two-sample t-test | 26 | 0.050 | 0.80 | NA | NA |
| A sample size of 18 per condition was required in order to ensure an 80% power to detect this difference at the 5% significance level. | TRUE | t-test | 18 | 0.050 | 0.80 | NA | NA |
| The 13,500 selected loan requests conservatively achieved a power of .98 for an effect size of .07 at an alpha level of .05. | TRUE | 13500 | 0.050 | 0.98 | 0.07 | NA | |
| On the basis of simulations over a range of expected effect sizes for contrasts of fMRI activity, we estimated that a sample size of 24 would provide .80 power at a conservative brainwide alpha threshold of .002 (although such thresholds ideally should be relaxed for detecting activity in regions where an effect is predicted). | TRUE | fMRI activity contrast | 24 | 0.002 | 0.80 | NA | NA |
| Stimulus sample size was determined via power analysis of the sole existing similar study, which used neural activity to predict Internet downloads of music (Berns & Moore, 2012). | TRUE | NA | NA | NA | NA | NA | NA |
| The effect size from that study implied that a sample size of 72 loan requests would be required to achieve .80 power at an alpha level of .05. | TRUE | 72 | 0.050 | 0.80 | NA | NA | |
| Categorical ratings of the emotional expressions in the loan photographs had a similarly powerful impact on loan-request success; requests with “happy” photographs received $5.15 more per hour than requests with “sad” photographs, on average; they achieved full funding in 7.6% less time. | FALSE | NA | NA | NA | NA | NA | NA |
| Although previous research has provided mixed evidence about the impact of positive versus negative affect on charitable giving (Andreoni, 1990;Small & Verrochi, 2009), by simultaneously assessing affect at both Internet-aggregate and laboratory-sample levels of analysis, our studies provide consistent evidence that photograph-elicited positive arousal most powerfully promoted lending rates and outcomes (Tables 1 and 2, Fig. 2a, and Fig. | FALSE | NA | NA | NA | NA | NA | NA |
Rate Limiting
The llm() function makes a separate query for each row
in a data frame from text_search(). (Using parallel
functions in ellmer can be more efficient but currently does not
associate structured output correctly when inputs may have 0 or more
outputs.) To prevent accidentally making too many calls because of
errors in your code, a default limit of 30 queries is set, which you can
change:
llm_max_calls(30)Repository Functions
Metacheck can find links to research repositories in a paper, retrieve the list of files they contain, and download those files for further checking. Four online services are supported, as well as local folders on your own computer.
| Repository | Link function | Info / file list | Download |
|---|---|---|---|
| OSF | osf_links() |
osf_info() |
osf_file_download() |
| GitHub | github_links() |
github_files(), github_info()
|
— |
| ResearchBox | rbox_links() |
rbox_info() |
rbox_file_download() |
| Zenodo | zenodo_links() |
zenodo_info() |
zenodo_file_download() |
| Local folder | — | local_files() |
— |
The repo_check and code_check modules use
all of these automatically: repo_check finds all repository
links in a paper and builds a unified file list, and
code_check then analyses any code files in that list. See
the GitHub and Local Files articles for more detail on
those two sources.
OSF Links and IDs
Get any OSF links from a paper or list of papers.
#> [1] "https://osf.io/e2aks/"
#> [2] "https://osf.io/tvyxz/wiki/view/"
#> [3] "https://osf.io/t9j8e/?view_only=f171281f212f4435917b16a9e581a73b"
#> [4] "https://osf.io/tvyxz/wiki/1.%20View%20the%20Badges/"
#> [5] "https://osf.io/eky4s/"
#> [6] "https://osf.io/xgwhk"
You can see that some of them have rogue spaces or view-only links.
The function osf_check_id() takes most formats of OSF links
(with or without https:// and
osf.io/, as well as the 25-character waterbutler IDs) and converts them
to short IDs.
osf_ids <- osf_check_id(links$href) |> unique()
head(osf_ids)#> [1] "e2aks"
#> [2] "tvyxz"
#> [3] "t9j8e?view_only=f171281f212f4435917b16a9e581a73b"
#> [4] "eky4s"
#> [5] "xgwhk"
#> [6] "5t0b7"
However, all of the osf_***() functions fix IDs for you
and handle duplicate IDs without making extra API calls, so you don’t
need to add this step to most workflows.
OSF Info
Get basic information about OSF links, such as the name, description, osf_type (nodes, files, preprints, registrations, users, set to “private” if you don’t have authorisation to view it, and “invalid” if the ), whether it is public
#> # A tibble: 6 × 5
#> href osf_id osf_type public category
#> <chr> <chr> <chr> <lgl> <chr>
#> 1 https://osf.io/e2aks/ e2aks nodes TRUE project
#> 2 https://osf.io/tvyxz/wiki/view/ tvyxz nodes TRUE project
#> 3 https://osf.io/tvyxz/wiki/view/ tvyxz nodes TRUE project
#> 4 https://osf.io/t9j8e/?view_only=f171281f212f4… t9j8e… nodes FALSE project
#> 5 https://osf.io/tvyxz/wiki/1.%20View%20the%20B… tvyxz nodes TRUE project
#> 6 https://osf.io/eky4s/ eky4s nodes TRUE project
For now, the OSF API does not let us retrieve any information about view-only links. They may be viewable by you in the web browser if the link is still active, but will be listed in the table as public = FALSE and osf_type = “private”.
You can set the argument recursive = TRUE to also
retrieve information about all nodes and files that are contained by the
OSF link.
#> [1] 3
Download OSF Files
OSF projects let you organise information into nested components, and files within those components. Therefore, to retrieve all of the files associate with a project, you may need to navigate to several components and download zip files for the files from each components, then reorganise and rename the downloaded folders.
The function osf_file_download() does all of this for
you, recreating a folder structure based on the component names and
downloading all files smaller than max_file_size (defaults
to 10 MB) up to a total size of max_download_size (defaults
to 100 MB).
osf_file_download(osf_id = "pngda",
download_to = ".",
max_file_size = 1,
max_download_size = 10)Starting retrieval for pngda
- omitting metacheck.png (1.5MB)
Downloading files [=====================] 24/24 00:00:35
list.files("pngda", recursive = TRUE)#> [1] "Data/Individual/data-01.csv"
#> [2] "Data/Individual/data-02.csv"
#> [3] "Data/Individual/data-03.csv"
#> [4] "Data/Individual/data-04.csv"
#> [5] "Data/Individual/data-05.csv"
#> [6] "Data/Individual/data-06.csv"
#> [7] "Data/Individual/data-07.csv"
#> [8] "Data/Individual/data-08.csv"
#> [9] "Data/Individual/data-09.csv"
#> [10] "Data/Individual/data-10.csv"
#> [11] "Data/Individual/data-11.csv"
#> [12] "Data/Individual/data-12.csv"
#> [13] "Data/Individual/data-13.csv"
#> [14] "Data/Individual/data-14.csv"
#> [15] "Data/Processed Data/processed-data.csv"
#> [16] "Data/Raw Data/data.xlsx"
#> [17] "Data/Raw Data/nest-1/nest-2/nest-3/nest-4/test-4.txt"
#> [18] "Data/Raw Data/nest-1/nest-2/nest-3/test-3.txt"
#> [19] "Data/Raw Data/nest-1/nest-2/test-2.txt"
#> [20] "Data/Raw Data/nest-1/README"
#> [21] "Data/Raw Data/nest-1/test-1.txt"
#> [22] "Data/Raw Data/README"
#> [23] "README"
GitHub, ResearchBox, and Zenodo
The same pattern — find links, retrieve file lists, optionally download — applies to the other three services.
# GitHub
gh_links <- github_links(paper)
gh_files <- github_files(gh_links$href, recursive = TRUE)
# ResearchBox
rb_links <- rbox_links(paper)
rb_info <- rbox_info(rb_links)
# Zenodo
z_links <- zenodo_links(paper)
z_info <- zenodo_info(z_links)See the GitHub article for a detailed walkthrough of the GitHub functions.
Local files
If files are not in an online repository — for example because you downloaded a zip archive from a reviewer submission, or because the authors used a service not yet supported — you can point metacheck at a local folder instead.
result <- module_run(test_paper(), "code_check",
local_path = "path/to/downloaded/files")See the Local Files article for full details, including how to handle cloud-synced folders and multiple paths at once.
Modules
metacheck is designed modularly, so you can add modules to check for anything. It comes with a set of pre-defined modules, and we hope people will share more modules.
Module List
You can see the list of built-in modules with the function below.
#>
#> *** GENERAL ***
#>
#> * all_urls: List all the URLs in the main text.
#> * coi_check: Identify and extract Conflicts of Interest (COI) statements.
#> * coi_check_oi: Identify and extract Conflicts of Interest (COI) statements.
#> * funding_check: Identify and extract funding statements.
#> * funding_check_oi: Identify and extract funding statements.
#> * open_practices: This module searches for open data, code, materials, and registration statements.
#>
#> *** METHOD ***
#>
#> * causal_claims: Aims to identify the presence of random assignment, and lists sentences that make causal claims in title or abstract.
#> * power: This module uses uses regular expressions to identify sentences that contain a statistical power analysis. If specified by the user, it also uses a large language module (LLM) to extract information reported in power analyses, including the statistical test, sample size, alpha level, desired level of power, and magnitude and type of effect size.
#> * prereg_check: Retrieve information from preregistrations in a standardised way,
#> and make them easier to check.
#>
#> *** RESULTS ***
#>
#> * all_p_values: List all p-values in the text, returning the matched text (e.g., 'p = 0.04') and document location in a table.
#> * code_check: This module retrieves information from repositories checked by repo_check about code files (R, SAS, SPSS, Stata).
#> * marginal: List all sentences that describe an effect as 'marginally significant'.
#> * repo_check: This module retrieves information from repositories.
#> * stat_check: Check consistency of p-values and test statistics
#> * stat_effect_size: The Effect Size module checks for effect sizes in t-tests and F-tests.
#> * stat_p_exact: List any p-values reported with insufficient precision (e.g., p < .05 or p = n.s.) or reported as exactly zero (e.g., p = .000).
#> * stat_p_nonsig: This module checks for imprecisely reported p values. If p > .05 is detected, it warns for misinterpretations.
#>
#> *** REFERENCE ***
#>
#> * ref_accuracy: This module checks references for mismatches with CrossRef.
#> * ref_consistency: Check if all references are cited and all citations are referenced
#> * ref_miscitation: Check for frequently miscited papers. This module is just a proof of concept -- the miscite database is not yet populated with real examples.
#> * ref_pubpeer: This module checks references and warns for citations that have comments on pubpeer (excluding Statcheck comments).
#> * ref_replication: This module checks references and warns for citations of original studies for which replication or reproduction studies exist in the FLoRA database.
#> * ref_retraction: This module checks references and warns for citations in the RetractionWatch Database.
#> * ref_summary: Summarise information about each reference in a paper.
#>
#> Use `module_help("module_name")` for help with a specific module
Running modules
To run a built-in module on a paper, you can reference it by name.
p <- module_run(paper, "all_p_values")| text | text_id | paragraph_id | section_id | page_number | formatted | paper_id | header | section_type | p_comp | p_value |
|---|---|---|---|---|---|---|---|---|---|---|
| p = 0.005 | 15 | 4 | 3 | NA | NA | to_err_is_human | Procedure | method | = | 0.005 |
| p =0.152 | 16 | 5 | 3 | NA | NA | to_err_is_human | Procedure | method | = | 0.152 |
| p > .05 | 17 | 6 | 3 | NA | There was no effect of experience on the reduction in errors when using the tool (p > .05), as the correlation was non-significant (Figure 2). | to_err_is_human | Procedure | method | > | 0.050 |
Creating modules
You can create your own modules using R code. Modules can also contain instructions for reporting, to give “traffic lights” for whether a check passed or failed, and to include appropriate text feedback in a report. See the modules vignette for more details.
Reports
You can generate a report from any set of modules. Check the function help for the default set.
report(paper, output_format = "qmd")See the example report.
