19  Effect Sizes

19.1 What it checks

The stat_effect_size module checks whether effect sizes are reported for t-tests and F-tests, and whether the reported effect size is coherent with the test statistic and degrees of freedom. Reporting a standardized effect size alongside a test is an APA Journal Article Reporting Standard, and it is what makes a result usable in a meta-analysis.

The module looks for the common ways effect sizes are written, then recomputes what the effect size should be from the test statistic and degrees of freedom. It checks Cohen’s d and g for t-tests, and partial η² and partial ω² for F-tests. It flags two things: tests that report no effect size, and reported effect sizes that do not match the value implied by the test.

This module is fully offline.

19.2 How the coherence check works (and its limits)

A test statistic does not always pin down a single effect size. For a t-test, the value of Cohen’s d you recover depends on the design: a paired t-test, an independent-samples t-test with equal group sizes, and one with unequal group sizes all imply different effect sizes from the same t and df. The reporting style rarely says which design was used, so the module computes d under several plausible assumptions — paired (dz), independent with equal n, and a range for independent with unequal n — and asks whether the reported value matches any of them. That is what the d_coherence value match_under_assumptions means.

This is the crucial caveat: a match under assumptions is not proof the effect size is correct. The check only establishes that the reported value is possible for some design consistent with the test statistic. If the authors ran a paired test but the reported d happens to coincide with the equal-n independent-samples value, the module will report a match even though the reported effect size is wrong for the design that was actually used. A match means “not obviously contradictory”; it does not mean “verified”.

The verdicts you will see are:

  • match_under_assumptions — the reported value is consistent with at least one of the designs the module tried. Possible, not proven.
  • no_match — the reported value does not match any tested design. This is the strong signal: the number is hard to reconcile with the test statistic under any assumption.
  • indeterminate — the module cannot recover an effect size to compare against (for example a Welch’s t-test with non-integer df, where the group sizes cannot be determined).

F-tests are a partial exception. When the numerator degrees of freedom is 1 (df1 = 1), partial η² is fixed by a single assumption-free formula, η²p = F / (F + df2), so a no_match there is an unambiguous arithmetic inconsistency rather than a design ambiguity.

19.3 Running the module

paper <- demopaper()
module_run(paper, "stat_effect_size")

Effect Sizes in t-tests and F-tests: We found 1 t-test and/or F-test where effect sizes are not reported. Check these tests in the table below, and consider adding effect sizes

The table has one row per detected test. The columns most worth reading are the test text, any reported effect size, and the coherence verdict:

mo <- module_run(paper, "stat_effect_size")
mo$traffic_light
#> [1] "yellow"
mo$summary_text
#> [1] "We found 1 t-test and/or F-test where effect sizes are not reported. Check these tests in the table below, and consider adding effect sizes"
mo$table[, c("test", "test_text", "es", "d_coherence", "d_coherence_note")] |>
  knitr::kable()
test test_text es d_coherence d_coherence_note
t-test t(97.7) = 2.9 d = 0.59 indeterminate Non-integer df indicates Welch’s t-test (unequal variances); sample sizes cannot be determined.
t-test t(97.2) = -1.96 NA indeterminate No parseable d effect size found.

In the demo paper one t-test reports d = 0.59 while another reports no effect size at all, which is why the module returns a yellow light. The d_coherence_note column explains why a value could not be verified — for example a Welch’s t-test (non-integer df), where sample sizes cannot be recovered from the test alone.

19.4 Running on many papers

mo <- module_run(psychsci[1:20], "stat_effect_size")
head(mo$summary_table) |>
  knitr::kable()
paper_id ttests_with_es ttests_without_es Ftests_with_es Ftests_without_es
0956797613520608 0 0 6 0
0956797614522816 6 2 28 0
0956797614527830 0 0 0 0
0956797614557697 0 2 5 0
0956797614560771 4 0 0 0
0956797614566469 0 0 0 0

19.5 A real example of an incorrect effect size

The no_match verdict catches genuine mistakes. The Psychological Science article “Going All In: Unfavorable Sex Ratios Attenuate Choice Diversification” (DOI 10.1177/0956797616636631) reports, in a single sentence, two main effects:

Main effects of sex ratio emerged for reward responsiveness, F(1, 361) = 3.59, p = .029, η²p = .02 […] and mating-impression management, F(1, 361) = 5.89, p < .01, η²p = .03.

Both have one numerator degree of freedom, so the partial η² is fixed exactly by η²p = F / (F + df2) — no assumptions, no ambiguity:

3.59 / (3.59 + 361)   # implied for the reported eta-squared of .02
#> [1] 0.009846677
5.89 / (5.89 + 361)   # implied for the reported eta-squared of .03
#> [1] 0.01605386

The implied values are .010 and .016, but the paper reports .02 and .03 — each roughly double the correct value. The module flags both as no_match:

paper <- psychsci[["0956797616636631"]]
mo <- module_run(paper, "stat_effect_size")
mo$table |>
  dplyr::filter(eta_coherence == "no_match") |>
  _[, c("test_text", "es", "eta_implied_partial", "eta_coherence")] |>
  knitr::kable()
test_text es eta_implied_partial eta_coherence
F(1, 361) = 3.59 ηp² = .02 0.00984667708933322 no_match
F(1, 361) = 5.89 ηp² = .03 0.016053858104609 no_match

Because df1 = 1 leaves no room for an alternative design, this is not a “possible under some assumption” case — the reported numbers are simply inconsistent with the test statistics. This is exactly the kind of error the check exists to surface, and the kind a reader would otherwise have to recompute by hand.

19.6 Interpreting the result

The two verdicts call for very different responses:

  • A no_match is worth checking carefully — as the example above shows, it can be a real inconsistency. But it can also be an artifact: an effect size reported to fewer than two decimals, a test statistic the module mis-parsed (for example an F value written with a thousands separator like 5,509.53), or an effect size in a style the module does not recognise. Read the original sentence before concluding there is an error.
  • A match_under_assumptions should not be read as a clean bill of health. As explained above, it only means the value is possible for some design — the reported effect size can still be wrong for the design the authors actually used. The module cannot confirm correctness; it can only rule out impossibility.

A missing effect size, by contrast, is unambiguous: it is usually worth adding.

19.7 Options

stat_effect_size takes only the paper argument.

19.8 Validation

In a sample of 161 papers with 1469 tests, this module correctly detected 1106 reported effect sizes (true positives) and correctly identified 295 cases where no effect size was present (true negatives). However, it missed 23 that were reported (false negatives), and incorrectly identified 45 effect sizes when none were reported (false positives). Among all instances detected by the module, 96% were true cases (positive predictive value). In a validation against 221 reported Cohen’s d effect sizes, it correctly indicated coherence in 218 cases (99%). In a validation against 485 partial eta-squared effect sizes, it correctly indicated coherence in 480 (99%).