Skip to main content
Glossary Testing & Experimentation

Sample Size

How to calculate the right sample size for A/B tests, why underpowered tests waste resources, and the formulas practitioners actually use.

Sample size is the number of observations (visitors, users, transactions) you need in each variant of an experiment to detect a real effect with acceptable confidence. Getting this wrong — in either direction — is one of the most expensive mistakes in experimentation.

Why sample size calculations matter

An underpowered test (too few observations) will fail to detect real improvements, leading you to discard winning ideas. An overpowered test (too many observations) wastes traffic and time you could have spent on the next experiment. Both cost you money.

The right sample size depends on three inputs:

  1. Baseline conversion rate. Your current performance. A 2% conversion rate requires a much larger sample than a 20% rate to detect the same relative change.

  2. Minimum detectable effect (MDE). The smallest improvement you care about. If a 1% relative lift isn’t worth implementing, don’t power your test to detect it.

  3. Statistical power and significance level. Typically 80% power and 95% significance (alpha = 0.05). Power is the probability of detecting a real effect when one exists.

The math in practice

For a standard two-proportion z-test with 80% power and 95% significance:

  • Baseline 5%, MDE 10% relative (to 5.5%): ~58,000 per variant
  • Baseline 5%, MDE 20% relative (to 6.0%): ~15,000 per variant
  • Baseline 20%, MDE 10% relative (to 22.0%): ~8,000 per variant

Notice the pattern: lower baseline rates and smaller MDEs require dramatically more traffic. This is why low-traffic sites struggle with experimentation — and why choosing the right MDE is a business decision, not just a statistical one.

Common mistakes

Running until you see significance. This is not a substitute for a sample size calculation. If you stop the moment p < 0.05, you’re guaranteed to overestimate the effect size and inflate false positives.

Ignoring variance in your metric. Revenue per visitor has much higher variance than binary conversion. Revenue-based tests typically need 3-5x more traffic than conversion-based tests.

Not accounting for segments. If you plan to analyze results by device type or traffic source, you need sufficient sample size within each segment, not just overall.

Practical example

Your e-commerce site converts at 3.5% with 10,000 daily visitors. You want to detect a 15% relative lift. At 80% power and 95% significance, you need roughly 22,000 visitors per variant — so a two-variant test needs about 4.5 days. If you want to detect a 5% relative lift, you need ~200,000 per variant — about 40 days. That calculation determines whether the test is worth running at all.

Work Together

Put This Into Practice

Understanding the theory is step one. Building an experimentation program that applies these concepts systematically — and ties every test to revenue — is where the real impact happens.

Lean Experiments Newsletter

Revenue Frameworks
for Growth Leaders

Every week: one experiment, one framework, one insight to make your marketing more evidence-based and your revenue more predictable.