5 Using Large Language Models

Most of Metacheck works entirely with regular expressions and R code, and needs no AI at all. A few modules can do more when a large language model (LLM) is available — most notably the power module, which can read sentences about power analyses and extract structured details (test, sample size, alpha, power, effect size), and the prereg_check module. Furthermore, you might want to use LLMs directly to categorize information retrieved through the text search, or extract information from papers.

This chapter explains how LLM support works in Metacheck: how to turn it on, how to choose a model, how to run a model locally with Ollama or through a cloud API, and how to query papers directly with llm().

Metacheck’s philosophy on AI

LLM use is entirely optional and opt-in. Metacheck restricts LLMs to classification of existing text, never evaluation of quality. We prioritise non-LLM methods, and the vast majority of modules work without any LLM. When you do use one, the recommended setup runs the model locally so no data leaves your computer.

Because the code in this chapter needs an LLM to be installed and configured, the examples below are shown but not executed when this book is built. You can run all of them yourself once you have set up either Ollama or a cloud API key, as described below.

5.1 Turning LLM support on and off

LLM support is controlled by a single global switch, llm_use(). It is off by default. Modules that can use an LLM check this switch and fall back to their regular-expression behaviour when it is off.

llm_use()        # check the current setting (FALSE by default)
llm_use(TRUE)    # turn LLM support on
llm_use(FALSE)   # turn it back off

So the same module call behaves differently depending on this switch:

paper <- demopaper()

llm_use(FALSE)
module_run(paper, "power")   # regex only: detects and classifies power statements

llm_use(TRUE)
module_run(paper, "power")   # also extracts structured details with the LLM

5.2 Choosing a model

The model is set with llm_model(). A model is named either as "provider" (use that provider’s default model) or "provider/model" (a specific model).

llm_model()                                 # show the model currently set
llm_model("ollama/qwen2.5:3b")              # a specific local Ollama model
llm_model("groq")                           # the default Groq cloud model
llm_model("groq/llama-3.3-70b-versatile")   # a specific Groq model

When Metacheck starts, it checks for known API keys (for example GROQ_API_KEY) in your environment and sets a sensible default model automatically.

5.2.1 Listing available models

llm_model_list() returns the models Metacheck can see across all configured providers. Pass a provider name to narrow it down.

llm_model_list()           # all models from all configured providers
llm_model_list("groq")     # just Groq models
llm_model_list("ollama")   # just the models you have pulled locally

5.2.2 Supported providers

Metacheck uses ellmer under the hood, so it supports any provider ellmer does, plus Groq and Ollama. The available provider platforms are:

Provider	Type	Notes
`ollama`	local	Recommended. Runs on your own machine; no key, no data sent out.
`groq`	cloud	Fast, free tier; needs `GROQ_API_KEY`.
`openai`	cloud	Needs `OPENAI_API_KEY`.
`anthropic` / `claude`	cloud	Needs `ANTHROPIC_API_KEY`.
`google_gemini` / `google_vertex`	cloud	Needs `GOOGLE_API_KEY`.
`mistral`	cloud	Needs a Mistral key.
`aws_bedrock`	cloud	AWS credentials.
`github`	cloud	GitHub Models.
`lmstudio` / `vllm`	local	Other local-serving options.

5.3 Option 1: Run a model locally with Ollama (recommended)

Ollama is a free, open-source tool that runs AI models on your own computer. No API key, no account, no usage costs, and no data leaves your machine. The trade-off is speed: your computer does the work, so it is slower than a cloud API — especially the first time a model loads.

Setup, in brief:

Download and install Ollama from https://ollama.com/download.
Pull a model from a terminal (not R). A small, capable starting model:
```
ollama pull qwen2.5:3b
```
Other useful models: llama3.2 (a good general default) and qwen2.5:7b (more capable, needs more memory). You can see everything you have pulled with ollama list.
Point Metacheck at it in R:

llm_use(TRUE)
llm_model("ollama/qwen2.5:3b")

For a full walkthrough — system requirements, verifying the server is running, choosing a model size for your hardware, and troubleshooting — see the dedicated Local AI with Ollama chapter.

5.4 Option 2: Use a cloud API

You can use any cloud model ellmer supports. This is faster than running locally but requires an account and API key with the provider, and sends text to an external service — keep this in mind for unpublished manuscripts.

Get an API key from your provider (for example https://console.groq.com/keys) and add it to your .Renviron file (open it with usethis::edit_r_environ()):

GROQ_API_KEY="your-key-here"

Then enable LLM support and select the model:

llm_use(TRUE)
llm_model("groq")   # or e.g. llm_model("groq/llama-3.3-70b-versatile")

5.5 Querying papers directly with `llm()`

Beyond the modules, you can send any text to an LLM with llm(). Narrow the text down first with text_search() so you only send relevant sentences (this also keeps the number of queries — and any costs — low).

As an example, suppose we want to know whether a paper states any constraints on the generalizability of its findings. A plain text search for “generaliz” finds candidate sentences, but it cannot tell the difference between a sentence about constraints on generality and one that merely mentions a “generalized linear model”. That judgement is exactly the kind of classification an LLM can do. Here we take one paper from psychsci, find every sentence mentioning generalizability, and ask the model a yes/no question about each:

generaliz <- text_search(psychsci[["09567976231222836"]], "generaliz",
                         ignore.case = TRUE)

system_prompt <- "Do the authors state any constraints on the generalizability of their findings? Answer only TRUE or FALSE."

result <- llm(generaliz, system_prompt)

The result adds an answer column to the searched sentences. Of the 15 sentences found in this paper, the model classified 6 as describing constraints on generalizability and 9 as not:

#> 
#> FALSE  TRUE 
#>     9     6

The sentences it flagged are the substantive ones — for example statements that the findings come from a single population, or that “there is no guarantee that evidence from WEIRD populations will generalize to other populations” — while the sentences it answered FALSE to are mostly methodological mentions of “generalized linear mixed models”:

Sentences flagged as constraints on generalizability
Finding that English speakers treat suffixed words as more similar to each other than prefixed words could simply reflect their extensive experience with a language which happens to adhere to the cross-linguistic generalization.
By definition, languages that violate cross-linguistic generalizations are relatively rare, and sometimes extremely rare, and thus accessing participant populations can be challenging.
But there is no guarantee that evidence from WEIRD populations will generalize to other populations.

5.5.1 Free text vs. structured output

By default llm() returns free-text responses in an answer column. For reliable extraction, pass an ellmer type specification so the provider returns structured fields:

type_spec <- ellmer::type_object(
  apriori = ellmer::type_boolean("Whether this is an a priori power analysis"),
  sample  = ellmer::type_integer("The planned sample size", required = FALSE)
)

result <- llm(power_sentences, "Extract power analysis details.", type = type_spec)

5.5.2 A worked example: extracting power analyses

You can also ask for JSON directly in the prompt. Below we narrow the first ten psychsci papers to sentences that mention “power” and contain a number, then ask the model to return structured details as JSON:

power <- psychsci[1:10] |>
  text_search("power") |>   # sentences containing the word power
  text_search("[0-9]")      # and containing at least one number

system_prompt <- 'Does this sentence report an a priori power analysis? If so, return the test, sample size, critical alpha criterion, power level, effect size and effect size metric plus any other relevant parameters, in JSON format like:

{
  "apriori": true,
  "test": "paired samples t-test",
  "sample": 20,
  "alpha": 0.05,
  "power": 0.8,
  "es": 0.4,
  "es_metric": "cohen\'s D"
}

If not, return {"apriori": false}

Answer only in valid JSON format, starting with { and ending with }.'

llm_power <- llm(power, system_prompt)

json_expand() turns the JSON answer column into proper columns, and handles malformed responses gracefully (setting an error column rather than failing):

llm_response <- json_expand(llm_power, "answer") |>
  dplyr::select(text, apriori:es_metric)

text	apriori	test	sample	alpha	power	es	es_metric
It is possible that less-consistent effects were observed on trials with errors because of reduced power to detect an effect on these trials, which by design were less numerous (~25%).	FALSE	NA	NA	NA	NA	NA	NA
Figure 1 shows that CY had very little predictive power for CLIM, but the fit in the transposed plot has an obvious bell-shaped curve.	FALSE	NA	NA	NA	NA	NA	NA
Sample size was calculated with an a priori power analysis, using the effect sizes reported by Küpper et al. (2014), who used identical procedures, materials, and dependent measures.	TRUE	NA	NA	NA	NA	NA	NA
We determined that a minimum sample size of 7 per group would be necessary for 95% power to detect an effect.	TRUE	t-test	7	0.050	0.95	NA	NA
For the first part of the task, 11 static visual images, one from each of the scenes in the film were presented once each on a black background for 2 s using Power-Point.	FALSE	NA	NA	NA	NA	NA	NA
A sample size of 26 per group was required to ensure 80% power to detect this difference at the 5% significance level.	TRUE	two-sample t-test	26	0.050	0.80	NA	NA
A sample size of 18 per condition was required in order to ensure an 80% power to detect this difference at the 5% significance level.	TRUE	t-test	18	0.050	0.80	NA	NA
The 13,500 selected loan requests conservatively achieved a power of .98 for an effect size of .07 at an alpha level of .05.	TRUE		13500	0.050	0.98	0.07	NA
On the basis of simulations over a range of expected effect sizes for contrasts of fMRI activity, we estimated that a sample size of 24 would provide .80 power at a conservative brainwide alpha threshold of .002 (although such thresholds ideally should be relaxed for detecting activity in regions where an effect is predicted).	TRUE	fMRI activity contrast	24	0.002	0.80	NA	NA
Stimulus sample size was determined via power analysis of the sole existing similar study, which used neural activity to predict Internet downloads of music (Berns & Moore, 2012).	TRUE	NA	NA	NA	NA	NA	NA
The effect size from that study implied that a sample size of 72 loan requests would be required to achieve .80 power at an alpha level of .05.	TRUE		72	0.050	0.80	NA	NA
Categorical ratings of the emotional expressions in the loan photographs had a similarly powerful impact on loan-request success; requests with “happy” photographs received $5.15 more per hour than requests with “sad” photographs, on average; they achieved full funding in 7.6% less time.	FALSE	NA	NA	NA	NA	NA	NA
Although previous research has provided mixed evidence about the impact of positive versus negative affect on charitable giving (Andreoni, 1990;Small & Verrochi, 2009), by simultaneously assessing affect at both Internet-aggregate and laboratory-sample levels of analysis, our studies provide consistent evidence that photograph-elicited positive arousal most powerfully promoted lending rates and outcomes (Tables 1 and 2, Fig. 2a, and Fig.	FALSE	NA	NA	NA	NA	NA	NA

5.5.3 Sending a whole paper

So far we have sent individual sentences. You can also send an entire paper in one call by collapsing all of its sentences into a single string. For example, to ask about the sample size of the demo paper:

paper <- read(demofile("json"))
full_text <- paste(paper$text$text, collapse = " ")

llm_use(TRUE)
llm_model("groq")
answer <- llm(full_text, "What was the total sample size in this study? Answer with just the number and a brief justification.")
answer$answer

#> [1] "100. \n\nThe total sample size is 100, as the study randomly assigned 50 scientists to a condition where their manuscript was automatically checked for errors, and 50 scientists to a control condition with a checklist."
#> attr(,"class")
#> [1] "ellmer_output"

Using Groq, this returned in under two seconds.

Why not do this locally with Ollama? A whole paper is a large prompt. The demo paper is only about 1,200 tokens, but a typical journal article is closer to 10,000. Sending that to a local model is impractical for two reasons:

Speed. A local model processes the entire prompt on your own hardware. Where a short sentence takes a second or two, a full paper can take minutes per call — and running it across many papers quickly becomes hours.
Context window. Many local models have a context window of only 4,000–8,000 tokens. A full paper can exceed it, so the model silently sees only part of the text (often missing the discussion and limitations at the end), giving misleading answers.

Cloud providers like Groq have large context windows and process the prompt on their own fast hardware, so a whole-paper query returns in seconds — at the cost of sending the full manuscript to an external service.

This is exactly why Metacheck’s design encourages narrowing the text with text_search() first: smaller, targeted queries are faster, fit inside a local model’s context window, and send less data. Reserve whole-paper prompts for short papers or cloud models.

5.5.4 Rate limiting

llm() makes one query per row of text. To guard against runaway costs from a coding mistake, Metacheck caps the number of queries (default 30). Change the cap with llm_max_calls():

llm_max_calls()      # check current limit
llm_max_calls(100)   # raise it

See ?llm for the full set of arguments and details.