Skip to main content
← Glossary · Statistics & Methodology

Statistical Power

The probability that a test will correctly detect a real effect when one exists — the complement of a Type II error (false negative).

Statistical power is the probability of finding a real winner when there is one. An 80% powered test (the standard) has a 20% chance of missing a real effect. That means 1 in 5 valid improvements get labeled "no significant difference" and thrown away.

Why 80% Power Is the Standard (and When It's Not Enough)

80% is a convention, not a law. It balances the cost of larger samples against the cost of missed opportunities. But for tests with large revenue implications, 90% power is worth the extra sample — you're reducing false negatives from 20% to 10%.

The Power-MDE Tradeoff

Power and Minimum Detectable Effect are inversely related at fixed sample sizes. If you want to detect smaller effects, you need more power (and thus more sample). Most experimentation programs implicitly accept that they can only detect effects above 5-10% relative — meaning smaller but genuine improvements are systematically invisible.

How Underpowered Tests Destroy Experimentation Programs

The most insidious consequence of low power: it creates a false narrative that "nothing works." When teams run underpowered tests, most show inconclusive results — not because the ideas were bad, but because the tests couldn't detect the effects. This leads to testing fatigue and program shutdown.

Practical Advice

Calculate power before you start, not after. If you can't achieve 80% power for a meaningful MDE within a reasonable timeframe, reconsider the test design — can you test a higher-frequency metric? Can you combine with another test? Or should you skip statistical testing and make a judgment call?