4 Text Search

Most Metacheck modules are built on top of one core operation: searching the text of a paper for sentences that match a pattern. The text_search() function gives you direct access to that operation, so you can explore a paper, prototype a new check, or build a corpus of sentences to code by hand.

We use a single paper from the built-in psychsci dataset for the small examples, and the whole dataset of 250 papers for the corpus-scale examples.

paper <- read(demofile("json"))

4.1 Searching a paper

You can access a parsed table of the full text of the paper via paper$text, but you may find it more convenient to use the function text_search(). The defaults return a data table of each sentence, with the section type, header, div, paragraph and sentence numbers, and file name. (The section type is a best guess from the headers, so may not always be accurate.)

text <- text_search(paper)

text	text_id	paragraph_id	section_id	page_number	formatted	paper_id	header	section_type
This paper demonstrates some good and poor practices for use with the {metacheck} R package.	1	1	0	NA	NA	to_err_is_human	Abstract	abstract
All data are simulated.	2	1	0	NA	NA	to_err_is_human	Abstract	abstract
The paper shows examples of (1) open and closed OSF links; (2a) citation of retracted papers, (2b) citations without a doi, (2c) citations with Pubpeer comments, (2d) citations in the FLoRA replication database, and (2e) missing/mismatched/incorrect citations and references; (3a) R files with code on GitHub that do not load libraries in one location, (3b) load files that are not shared in the repository, (3c) lack comments, and (3d) have absolute file paths; (4) imprecise reporting of non-significant p-values; (5) tests with and without effect sizes; (6) use of “marginally significant” to describe non-significant findings; (7) a power analysis reporting some of the essential attributes; and (8) retrieving information from preregistrations.	3	1	0	NA	NA	to_err_is_human	Abstract	abstract
Although intentional dishonesty might be a successful way to boost creativity (Gino and Wiltermuth 2014), it is safe to say most mistakes researchers make are unintentional.	4	2	1	NA	Although intentional dishonesty might be a successful way to boost creativity (Gino and Wiltermuth 2014), it is safe to say most mistakes researchers make are unintentional.	to_err_is_human	[div-01]	intro
From a human factors perspective, human error is a symptom of a poor design (Smithy, 2020).	5	2	1	NA	NA	to_err_is_human	[div-01]	intro
Automation can be used to check for errors in scientific manuscripts, and inform authors about possible corrections.	6	2	1	NA	NA	to_err_is_human	[div-01]	intro

4.2 Pattern

You can search for a specific word or phrase by setting the pattern argument. The pattern is a regex string by default; set fixed = TRUE if you want to find exact text matches.

text <- text_search(paper, pattern = "metacheck")

text	text_id	paragraph_id	section_id	page_number	formatted	paper_id	header	section_type
This paper demonstrates some good and poor practices for use with the {metacheck} R package.	1	1	0	NA	NA	to_err_is_human	Abstract	abstract
In this study we examine the usefulness of metacheck to improve best practices.	7	2	1	NA	NA	to_err_is_human	[div-01]	intro

4.3 Return

Set return to one of “sentence”, “paragraph”, “section”, or “match” to control what gets returned.

text <- text_search(paper, "GitHub",
                    return = "paragraph")

text	text_id	paragraph_id	section_id	page_number	formatted	paper_id	header	section_type
This paper demonstrates some good and poor practices for use with the {metacheck} R package. All data are simulated. The paper shows examples of (1) open and closed OSF links; (2a) citation of retracted papers, (2b) citations without a doi, (2c) citations with Pubpeer comments, (2d) citations in the FLoRA replication database, and (2e) missing/mismatched/incorrect citations and references; (3a) R files with code on GitHub that do not load libraries in one location, (3b) load files that are not shared in the repository, (3c) lack comments, and (3d) have absolute file paths; (4) imprecise reporting of non-significant p-values; (5) tests with and without effect sizes; (6) use of “marginally significant” to describe non-significant findings; (7) a power analysis reporting some of the essential attributes; and (8) retrieving information from preregistrations.	NA	1	0	NA	NA	to_err_is_human	Abstract	abstract
Data and analysis code is available on GitHub from https://github.com/Lak-ens/to_err_is_human and from https://researchbox.org/4377.	NA	8	5	NA	NA	to_err_is_human	Data Availability	availability

4.3.1 Regex matches

You can also return just the matched text from a regex search by setting return = "match". The extra ... arguments in text_search() are passed to grep(), so perl = TRUE allows you to use more complex regex, like below.

pattern <- "[a-zA-Z]\\S*\\s*(=|<)\\s*[0-9\\.,-]*\\d"
text <- text_search(paper, pattern, return = "match", perl = TRUE)

text	text_id	paragraph_id	section_id	page_number	formatted	paper_id	header	section_type
M = 9.12	15	4	3	NA	NA	to_err_is_human	Procedure	method
M = 10.9	15	4	3	NA	NA	to_err_is_human	Procedure	method
t(97.7) = 2.9	15	4	3	NA	NA	to_err_is_human	Procedure	method
p = 0.005	15	4	3	NA	NA	to_err_is_human	Procedure	method
d =0.59	15	4	3	NA	NA	to_err_is_human	Procedure	method
M = 5.06	16	5	3	NA	NA	to_err_is_human	Procedure	method
M = 4.5	16	5	3	NA	NA	to_err_is_human	Procedure	method
t(97.2) = -1.96	16	5	3	NA	NA	to_err_is_human	Procedure	method
p =0.152	16	5	3	NA	NA	to_err_is_human	Procedure	method
N = 50	24	11	7	NA	NA	to_err_is_human	Power Analysis	annex
pwr::pwr.t.test(n = 50	27	14	8	NA	NA	to_err_is_human	Results	annex
power = 0.8	27	14	8	NA	NA	to_err_is_human	Results	annex

4.4 Expand Text

You can expand the text returned by text_search() or a module with text_expand(). This is useful when a matched sentence only makes sense together with the sentences around it.

marginal <- text_search(paper, "marginal") |>
  text_expand(paper, plus = 1, minus = 1)

marginal[, c("text", "expanded")]

#> # A tibble: 2 × 2
#>   text                                                                  expanded
#>   <chr>                                                                 <chr>   
#> 1 "The paper shows examples of (1) open and closed OSF links; (2a) cit… "All da…
#> 2 "On average researchers in the experimental condition found the app … "On ave…

4.5 Refining a search across a corpus

When you want to find every sentence about a concept across many papers — for example, to study how often a practice appears, or to build a set of sentences to code by hand — you usually develop the search iteratively: start broad, look at what you catch, then tighten the pattern to remove false positives. The example below uses the full psychsci dataset to find sentences about statistical power.

4.5.1 Start with a fixed term

The most specific search is a fixed phrase:

power_analysis <- text_search(psychsci, "power analysis")
nrow(power_analysis)

#> [1] 102

This is precise but probably misses sentences that discuss power without using that exact phrase. Broadening to just “power” catches far more:

power <- text_search(psychsci, "power")
nrow(power)

#> [1] 767

4.5.2 Tighten with regex

Skimming the broad results shows false positives like “powerful” and “PowerPoint”, which never describe a power analysis. A negative lookahead (which needs perl = TRUE) excludes them while keeping “power” and “powered”:

power_specific <- text_search(psychsci, "power(?!ful|point)",
                              perl = TRUE, ignore.case = TRUE)
nrow(power_specific)

#> [1] 696

Before trusting a pattern, it is worth testing it on a few handmade examples:

pattern <- "power(?!ful|point)"
yes <- c("power", "power analysis", "powered")
no  <- c("powerful", "PowerPoint")
grepl(pattern, yes, perl = TRUE, ignore.case = TRUE)

#> [1] TRUE TRUE TRUE

grepl(pattern, no,  perl = TRUE, ignore.case = TRUE)

#> [1] FALSE FALSE

4.5.3 Chain searches to exclude irrelevant sentences

Piping one text_search() into another lets you progressively narrow the set. A whole-sentence negative lookahead drops sentences about a different sense of the word — here, “power” in the sense of EEG oscillation power:

power_clean <- text_search(psychsci, "power", ignore.case = TRUE) |>
  text_search("^(?!.*oscillat).*$", perl = TRUE)
nrow(power_clean)

#> [1] 754

4.5.4 Check what you excluded

When you make a pattern more specific, use dplyr::anti_join() to see which sentences dropped out, so you can confirm you are not losing relevant text (false negatives):

excluded <- anti_join(power, power_clean,
                      by = c("paper_id", "text_id"))
head(excluded[, c("paper_id", "text")]) |>
  knitr::kable()

paper_id	text
09567976211001317	A wide range of findings has linked modulations of oscillatory power, phase, and frequency to various cognitive functions, such as attention, language, and memory (Wang, 2010).
09567976211001317	Critically, this assumption underlay their interpretation of the bandlimited difference between the responses to expected and unexpected outcomes: Given that theta entrainment took place, the observed difference in oscillatory power must have reflected an effect on infant theta oscillations, in line with theta’s postulated sensitivity to violations of expectations.
09567976211001317	Here, we argue that both the assumption of entrainment in the first place and the consequent interpretation of band-limited power differences as modulations of entrained oscillations could be versions of the Fourier fallacy ( Jasper, 1948), that is, premature interpretations of frequency-domain effects in terms of oscillatory activity.
09567976211001317	Additionally, impulse-like event-related potentials (ERPs) manifest as low-frequency EEG power transients, despite arising from signals that may or may not be related to neural oscillations (Herrmann et al., 2005).
09567976211001317	At no point was the power of an “oscillation” manipulated.
09567976211001317	Although our simulation did not involve modulating oscillatory (SSVEP) power, our results bear close qualitative similarity with the effects reported by Köster et al.

4.5.5 Find outlier papers

A paper that mentions a term very often is often using it in a different sense. Counting matches per paper helps you spot these:

count(power, paper_id, sort = TRUE) |>
  filter(n > 10) |>
  head() |>
  knitr::kable()

paper_id	n
0956797616647519	97
09567976211001317	26
09567976241254312	19
09567976211007788	17
0956797620951115	13
09567976211028978	13

You can repeat this cycle — broaden, inspect, tighten, check exclusions — until you are satisfied that the pattern catches the relevant text without too much noise. This iterative search is the foundation for writing your own module: once you have a pattern you trust, you can wrap it in a module and validate it against hand-coded ground truth.

4.6 Querying with an LLM

For questions that a regular expression cannot answer — for example, “does this sentence describe an a priori power analysis?” — you can pass the searched text to a large language model. Narrow the text with text_search() first, then call llm(). See the Using Large Language Models chapter for details.