8 Power Analysis Check

8.1 What it checks

The power module uses regular expressions to find sentences that describe a statistical power analysis, and classifies each as a priori, sensitivity, or post-hoc. When LLM support is turned on, it additionally reads each power sentence and extracts structured details — the statistical test, planned sample size, alpha level, desired power, and effect size — and checks whether the analysis is fully reported.

A priori power analyses are the useful kind; post-hoc (“observed”) power is rarely informative.

8.2 Running the module

Every module is run the same way, with module_run(paper, "module_name"). We use the built-in demopaper(), which contains a few power statements.

paper <- demopaper()
module_run(paper, "power")

Power Analysis Check: We detected 3 potential power analyses.

Printing the result renders the module’s report. To work with the underlying data, use the returned list. The table has one row per detected power statement, with a power_type classification and a complete flag:

mo <- module_run(paper, "power")
mo$traffic_light

#> [1] "yellow"

mo$summary_text

#> [1] "We detected 3 potential power analyses."

mo$table[, c("text", "power_type", "complete")] |>
  knitr::kable()

text	power_type	complete
This paper demonstrates some good and poor practices for use with the {metacheck} R package. All data are simulated. The paper shows examples of (1) open and closed OSF links; (2a) citation of retracted papers, (2b) citations without a doi, (2c) citations with Pubpeer comments, (2d) citations in the FLoRA replication database, and (2e) missing/mismatched/incorrect citations and references; (3a) R files with code on GitHub that do not load libraries in one location, (3b) load files that are not shared in the repository, (3c) lack comments, and (3d) have absolute file paths; (4) imprecise reporting of non-significant p-values; (5) tests with and without effect sizes; (6) use of “marginally significant” to describe non-significant findings; (7) a power analysis reporting some of the essential attributes; and (8) retrieving information from preregistrations.	unknown	NA
We conducted a sensitivity power analysis 2 to determine that a Cohen’s d of 0.50 is the smallest effect size that we could detect with N = 50 participants in each group and 80% power.	sensitivity	NA
pwr::pwr.t.test(n = 50, power = 0.8, alternative = “greater”) is useful. We conclude the use of automated checks has potential to reduce the number of mistakes in scientific manuscripts.	unknown	NA

8.3 Running on many papers

Pass a paperlist such as the built-in psychsci (250 open-access Psychological Science articles) to run across a corpus. Here we run on the first 20 papers and summarise how many power statements were found per paper.

mo <- module_run(psychsci[1:20], "power")
head(mo$summary_table) |>
  knitr::kable()

paper_id	power_n	power_complete
0956797613520608	0	NA
0956797614522816	0	NA
0956797614527830	0	NA
0956797614557697	1	NA
0956797614560771	0	NA
0956797614566469	0	NA

8.4 A clean example and one with problems

A paper that reports an a priori power analysis with all the needed details is the “good” case; a paper that mentions power only vaguely, or reports post-hoc power, is the kind of thing the module surfaces for a closer look.

# demopaper contains both a priori and less-complete power statements
module_run(demopaper(), "power")

Power Analysis Check: We detected 3 potential power analyses.

The power_type column tells you which statements are a priori (worth keeping) versus post-hoc (usually a red flag), and complete flags a priori analyses that are missing required information.

8.5 Options: using an LLM

The regular-expression behaviour above runs with no setup. To also extract structured details — the statistical test, sample size, alpha, power, and effect size — turn on LLM support. The module reads the same switch (llm_use()) and model (llm_model()) described in the LLM chapter; there is no special module argument. When an LLM is enabled, the result table gains columns such as statistical_test, sample_size, power, effect_size, and effect_size_metric.

Below we run exactly the same call three ways — with no LLM, with a local Ollama model, and with a cloud model on Groq — on demopaper(), and compare what each returns. (The LLM outputs shown here were captured from real runs; running the LLM at build time would require Ollama or an API key, so the results are loaded from saved objects.)

8.5.1 No LLM (default)

llm_use(FALSE)
module_run(demopaper(), "power")

This is the regex-only behaviour shown earlier: it detects and classifies power statements, but the extraction columns are empty.

8.5.2 Local LLM with Ollama

This run used qwen3.5:9b (a 9.7-billion-parameter model) running locally with Ollama. Nothing leaves your machine, but the model does the work, so it is slower (this call took about 105 seconds).

llm_use(TRUE)
llm_model("ollama/qwen3.5:9b")
module_run(demopaper(), "power")

#> [1] "We detected 1 potential power analysis."

power_type	statistical_test	sample_size	power	effect_size	effect_size_metric	complete
sensitivity	NA	100	0.8	0.5	Cohen’s d	FALSE

The local model found the one genuine power analysis in the demo paper — the sensitivity analysis — and correctly extracted Cohen’s d = 0.5 and 80% power. It did not flag anything else.

8.5.3 Cloud LLM with Groq

This run used Groq’s default model (llama-3.1-8b-instant). It is much faster (about 11 seconds) but sends the text to an external service.

llm_use(TRUE)
llm_model("groq")
module_run(demopaper(), "power")

#> [1] "We detected 3 potential power analyses."

power_type	statistical_test	sample_size	power	effect_size	effect_size_metric	software	complete
apriori	NA	NA	NA	NA	NA	NA	FALSE
sensitivity	NA	50	0.8	0.5	Cohen’s d	NA	FALSE
apriori	t-test	50	0.8	NA	NA	R	FALSE

8.5.4 Discussing the difference

The two models behaved noticeably differently on the same input:

Recall vs. precision. The local qwen3.5:9b returned a single, clean row — the real sensitivity analysis. Groq’s smaller, faster llama-3.1-8b-instant returned three rows: it caught the same sensitivity analysis, but also flagged the paper’s abstract (a false positive, with every field NA) and a line of pwr.t.test() example code (extracting software = R and statistical_test = t-test).
Extraction detail. Both correctly pulled Cohen’s d = 0.5 and power = 0.8 for the sensitivity analysis. Neither marked any analysis as complete, because the demo paper does not report every required field.
Speed vs. privacy. Groq returned in ~11 s versus ~105 s locally, but the local run kept the manuscript on the machine. For unpublished work, the slower local model is usually the right trade-off.

There is no single “correct” answer here — the output depends on the model. A larger or more capable model generally gives cleaner extraction with fewer false positives, which is exactly why the model you choose matters. This is also why module output is meant to prompt human judgement rather than serve as a verdict: a person reading these tables quickly sees that the abstract and the code line are not real power analyses. See the Local AI with Ollama chapter for more on choosing and running local models.

8.5.5 Reproducibility

The power module accepts a seed argument (default 8675309) passed to the LLM, so repeated runs with the same model are more reproducible:

module_run(demopaper(), "power", seed = 1)

8.6 Validation

In a sample of 128 papers with 246 instances of power statements, 203 were correctly detected (true positives), 22 were missed (false negatives) and 21 were incorrectly detected (false positives). Among all instances flagged as power statements, 90.6% were correct (positive predictive value).

As with all modules, treat the output as a prompt for human judgement, not a verdict.