paper <- read(demofile("json"))4 Text Search
Most Metacheck modules are built on top of one core operation: searching the text of a paper for sentences that match a pattern. The text_search() function gives you direct access to that operation, so you can explore a paper, prototype a new check, or build a corpus of sentences to code by hand.
We use a single paper from the built-in psychsci dataset for the small examples, and the whole dataset of 250 papers for the corpus-scale examples.
4.1 Searching a paper
You can access a parsed table of the full text of the paper via paper$text, but you may find it more convenient to use the function text_search(). The defaults return a data table of each sentence, with the section type, header, div, paragraph and sentence numbers, and file name. (The section type is a best guess from the headers, so may not always be accurate.)
text <- text_search(paper)| text | text_id | paragraph_id | section_id | page_number | formatted | paper_id | header | section_type |
|---|---|---|---|---|---|---|---|---|
| This paper demonstrates some good and poor practices for use with the {metacheck} R package. | 1 | 1 | 0 | NA | NA | to_err_is_human | Abstract | abstract |
| All data are simulated. | 2 | 1 | 0 | NA | NA | to_err_is_human | Abstract | abstract |
| The paper shows examples of (1) open and closed OSF links; (2a) citation of retracted papers, (2b) citations without a doi, (2c) citations with Pubpeer comments, (2d) citations in the FLoRA replication database, and (2e) missing/mismatched/incorrect citations and references; (3a) R files with code on GitHub that do not load libraries in one location, (3b) load files that are not shared in the repository, (3c) lack comments, and (3d) have absolute file paths; (4) imprecise reporting of non-significant p-values; (5) tests with and without effect sizes; (6) use of “marginally significant” to describe non-significant findings; (7) a power analysis reporting some of the essential attributes; and (8) retrieving information from preregistrations. | 3 | 1 | 0 | NA | NA | to_err_is_human | Abstract | abstract |
| Although intentional dishonesty might be a successful way to boost creativity (Gino and Wiltermuth 2014), it is safe to say most mistakes researchers make are unintentional. | 4 | 2 | 1 | NA | Although intentional dishonesty might be a successful way to boost creativity (Gino and Wiltermuth 2014), it is safe to say most mistakes researchers make are unintentional. | to_err_is_human | [div-01] | intro |
| From a human factors perspective, human error is a symptom of a poor design (Smithy, 2020). | 5 | 2 | 1 | NA | NA | to_err_is_human | [div-01] | intro |
| Automation can be used to check for errors in scientific manuscripts, and inform authors about possible corrections. | 6 | 2 | 1 | NA | NA | to_err_is_human | [div-01] | intro |
4.2 Pattern
You can search for a specific word or phrase by setting the pattern argument. The pattern is a regex string by default; set fixed = TRUE if you want to find exact text matches.
text <- text_search(paper, pattern = "metacheck")| text | text_id | paragraph_id | section_id | page_number | formatted | paper_id | header | section_type |
|---|---|---|---|---|---|---|---|---|
| This paper demonstrates some good and poor practices for use with the {metacheck} R package. | 1 | 1 | 0 | NA | NA | to_err_is_human | Abstract | abstract |
| In this study we examine the usefulness of metacheck to improve best practices. | 7 | 2 | 1 | NA | NA | to_err_is_human | [div-01] | intro |
4.3 Return
Set return to one of “sentence”, “paragraph”, “section”, or “match” to control what gets returned.
text <- text_search(paper, "GitHub",
return = "paragraph")| text | text_id | paragraph_id | section_id | page_number | formatted | paper_id | header | section_type |
|---|---|---|---|---|---|---|---|---|
| This paper demonstrates some good and poor practices for use with the {metacheck} R package. All data are simulated. The paper shows examples of (1) open and closed OSF links; (2a) citation of retracted papers, (2b) citations without a doi, (2c) citations with Pubpeer comments, (2d) citations in the FLoRA replication database, and (2e) missing/mismatched/incorrect citations and references; (3a) R files with code on GitHub that do not load libraries in one location, (3b) load files that are not shared in the repository, (3c) lack comments, and (3d) have absolute file paths; (4) imprecise reporting of non-significant p-values; (5) tests with and without effect sizes; (6) use of “marginally significant” to describe non-significant findings; (7) a power analysis reporting some of the essential attributes; and (8) retrieving information from preregistrations. | NA | 1 | 0 | NA | NA | to_err_is_human | Abstract | abstract |
| Data and analysis code is available on GitHub from https://github.com/Lak-ens/to_err_is_human and from https://researchbox.org/4377. | NA | 8 | 5 | NA | NA | to_err_is_human | Data Availability | availability |
4.3.1 Regex matches
You can also return just the matched text from a regex search by setting return = "match". The extra ... arguments in text_search() are passed to grep(), so perl = TRUE allows you to use more complex regex, like below.
pattern <- "[a-zA-Z]\\S*\\s*(=|<)\\s*[0-9\\.,-]*\\d"
text <- text_search(paper, pattern, return = "match", perl = TRUE)| text | text_id | paragraph_id | section_id | page_number | formatted | paper_id | header | section_type |
|---|---|---|---|---|---|---|---|---|
| M = 9.12 | 15 | 4 | 3 | NA | NA | to_err_is_human | Procedure | method |
| M = 10.9 | 15 | 4 | 3 | NA | NA | to_err_is_human | Procedure | method |
| t(97.7) = 2.9 | 15 | 4 | 3 | NA | NA | to_err_is_human | Procedure | method |
| p = 0.005 | 15 | 4 | 3 | NA | NA | to_err_is_human | Procedure | method |
| d =0.59 | 15 | 4 | 3 | NA | NA | to_err_is_human | Procedure | method |
| M = 5.06 | 16 | 5 | 3 | NA | NA | to_err_is_human | Procedure | method |
| M = 4.5 | 16 | 5 | 3 | NA | NA | to_err_is_human | Procedure | method |
| t(97.2) = -1.96 | 16 | 5 | 3 | NA | NA | to_err_is_human | Procedure | method |
| p =0.152 | 16 | 5 | 3 | NA | NA | to_err_is_human | Procedure | method |
| N = 50 | 24 | 11 | 7 | NA | NA | to_err_is_human | Power Analysis | annex |
| pwr::pwr.t.test(n = 50 | 27 | 14 | 8 | NA | NA | to_err_is_human | Results | annex |
| power = 0.8 | 27 | 14 | 8 | NA | NA | to_err_is_human | Results | annex |
4.4 Expand Text
You can expand the text returned by text_search() or a module with text_expand(). This is useful when a matched sentence only makes sense together with the sentences around it.
marginal <- text_search(paper, "marginal") |>
text_expand(paper, plus = 1, minus = 1)
marginal[, c("text", "expanded")]#> # A tibble: 2 × 2
#> text expanded
#> <chr> <chr>
#> 1 "The paper shows examples of (1) open and closed OSF links; (2a) cit… "All da…
#> 2 "On average researchers in the experimental condition found the app … "On ave…
4.5 Refining a search across a corpus
When you want to find every sentence about a concept across many papers — for example, to study how often a practice appears, or to build a set of sentences to code by hand — you usually develop the search iteratively: start broad, look at what you catch, then tighten the pattern to remove false positives. The example below uses the full psychsci dataset to find sentences about statistical power.
4.5.1 Start with a fixed term
The most specific search is a fixed phrase:
power_analysis <- text_search(psychsci, "power analysis")
nrow(power_analysis)#> [1] 102
This is precise but probably misses sentences that discuss power without using that exact phrase. Broadening to just “power” catches far more:
power <- text_search(psychsci, "power")
nrow(power)#> [1] 767
4.5.2 Tighten with regex
Skimming the broad results shows false positives like “powerful” and “PowerPoint”, which never describe a power analysis. A negative lookahead (which needs perl = TRUE) excludes them while keeping “power” and “powered”:
power_specific <- text_search(psychsci, "power(?!ful|point)",
perl = TRUE, ignore.case = TRUE)
nrow(power_specific)#> [1] 696
Before trusting a pattern, it is worth testing it on a few handmade examples:
4.5.3 Chain searches to exclude irrelevant sentences
Piping one text_search() into another lets you progressively narrow the set. A whole-sentence negative lookahead drops sentences about a different sense of the word — here, “power” in the sense of EEG oscillation power:
power_clean <- text_search(psychsci, "power", ignore.case = TRUE) |>
text_search("^(?!.*oscillat).*$", perl = TRUE)
nrow(power_clean)#> [1] 754
4.5.4 Check what you excluded
When you make a pattern more specific, use dplyr::anti_join() to see which sentences dropped out, so you can confirm you are not losing relevant text (false negatives):
| paper_id | text |
|---|---|
| 09567976211001317 | A wide range of findings has linked modulations of oscillatory power, phase, and frequency to various cognitive functions, such as attention, language, and memory (Wang, 2010). |
| 09567976211001317 | Critically, this assumption underlay their interpretation of the bandlimited difference between the responses to expected and unexpected outcomes: Given that theta entrainment took place, the observed difference in oscillatory power must have reflected an effect on infant theta oscillations, in line with theta’s postulated sensitivity to violations of expectations. |
| 09567976211001317 | Here, we argue that both the assumption of entrainment in the first place and the consequent interpretation of band-limited power differences as modulations of entrained oscillations could be versions of the Fourier fallacy ( Jasper, 1948), that is, premature interpretations of frequency-domain effects in terms of oscillatory activity. |
| 09567976211001317 | Additionally, impulse-like event-related potentials (ERPs) manifest as low-frequency EEG power transients, despite arising from signals that may or may not be related to neural oscillations (Herrmann et al., 2005). |
| 09567976211001317 | At no point was the power of an “oscillation” manipulated. |
| 09567976211001317 | Although our simulation did not involve modulating oscillatory (SSVEP) power, our results bear close qualitative similarity with the effects reported by Köster et al. |
4.5.5 Find outlier papers
A paper that mentions a term very often is often using it in a different sense. Counting matches per paper helps you spot these:
| paper_id | n |
|---|---|
| 0956797616647519 | 97 |
| 09567976211001317 | 26 |
| 09567976241254312 | 19 |
| 09567976211007788 | 17 |
| 0956797620951115 | 13 |
| 09567976211028978 | 13 |
You can repeat this cycle — broaden, inspect, tighten, check exclusions — until you are satisfied that the pattern catches the relevant text without too much noise. This iterative search is the foundation for writing your own module: once you have a pattern you trust, you can wrap it in a module and validate it against hand-coded ground truth.
4.6 Querying with an LLM
For questions that a regular expression cannot answer — for example, “does this sentence describe an a priori power analysis?” — you can pass the searched text to a large language model. Narrow the text with text_search() first, then call llm(). See the Using Large Language Models chapter for details.
