Statistical Power and the Surprise Threshold

Most published statistical findings are built on two concepts that almost everyone confuses. The p-value tells you how surprised you should be if the null hypothesis were true. Statistical power tells you whether your study had any chance of detecting the effect in the first place. Confuse them and you will believe things that are not there and miss things that are.

Simple Picture

You are in a dark room with a flashlight. The p-value tells you whether the thing you saw in the beam is surprising. Statistical power is how bright your flashlight is. A dim flashlight (low power) means you only see elephants — mice walk right past you undetected. A bright flashlight (high power) picks up subtler movements.

A study that fails to find an effect with low power has told you almost nothing — it only means the effect was not elephant-sized. The absence of evidence is not evidence of absence, but only when the flashlight was actually bright enough to matter.

P-Values: Surprise, Not Truth

The p-value answers a narrow question: if nothing were going on, how often would I see data this extreme? The conventional threshold is 0.05 — a 5% chance of seeing this result under the null hypothesis. Below 0.05, you call it “statistically significant.”

What the p-value does not tell you:

The magnitude of the effect. A tiny, meaningless effect can be statistically significant with enough data. A drug that lowers blood pressure by 0.1 mmHg can hit p < 0.01 with a million participants. The effect is real and completely useless.
The probability that the hypothesis is true. The p-value is P(data | null), not P(null | data). This inversion is the most common misread in all of applied statistics.
Whether anyone should care. Significance is a property of the test, not a property of the world. Goodhart’s Law applies directly: once p < 0.05 becomes the target for publication, researchers optimize for it — through selective reporting, flexible analysis, and quiet data exclusion — and the metric decouples from the reality it was supposed to represent.

Power: Can You Even See It?

Statistical power is the probability of detecting an effect that actually exists. It is a function of three things: sample size, effect size, and the significance threshold.

Low power means you need the signal to be enormous before your study can pick it up. With a sample of 20, only the most dramatic effects will cross the significance threshold. Everything subtle — which is most of what matters in medicine, finance, and social science — vanishes into noise.

The structural problem: most studies are underpowered. Adequate sample sizes are expensive. Journals do not reward null results. So the literature is dominated by studies that either found elephant-sized effects (rare and often inflated) or got lucky with noise (common and not replicable). This is the engine behind the replication crisis — not fraud, but a systematic structural bias toward publishing flukes.

Dimwit / Midwit / Better Take

The dimwit take is “p < 0.05 means it’s true, p > 0.05 means it’s false.”

The midwit take is “p-values are broken — we should switch to Bayesian methods and confidence intervals.” This is directionally correct but misses the deeper issue. Better statistical tools in the hands of the same incentive structures produce the same distortions. The problem is not the math.

The better take is that the entire pipeline from data to publication is a filter, and like all filters it manufactures patterns (Berkson’s Paradox in slow motion). Underpowered studies that happen to find significance are published. Underpowered studies that find nothing are filed away. The published literature is the admitted class — selected for surprise, not for truth. The fix is not better statistics but better incentives: pre-registration, mandatory power analysis, and journals that publish null results.

Main Payoff

The practical takeaway for anyone reading research — in finance, medicine, or social science — is a two-question filter:

How surprised should I be? (the p-value)
Could this study have even detected a realistic effect? (the power)

If the answer to #2 is “barely,” then a significant finding is more likely to be noise than signal, and a null finding is uninformative. Most of what passes for evidence in underpowered fields is just the flashlight being too dim to see the mice — and the elephants that show up in the beam are often shadows.

References

Notes on Statistics Done Wrong, Moontower Meta

小观园Prospect Garden

Atlas

Statistical Power and the Surprise Threshold中

Simple Picture

P-Values: Surprise, Not Truth

Power: Can You Even See It?

Dimwit / Midwit / Better Take

Main Payoff

References

Graph View

Table of Contents

Backlinks