Sample Size

Q: How is Sample Size used in experimentation?

Sample Size is a core concept in CRO and experimentation programs. Understanding it helps practitioners design better tests, interpret results accurately, and make evidence-based decisions that drive revenue.

Atticus Li

Testing & Experimentation

How to calculate the right sample size for A/B tests, why underpowered tests waste resources, and the formulas practitioners actually use.

Sample size is the number of observations (visitors, users, transactions) you need in each variant of an experiment to detect a real effect with acceptable confidence. Getting this wrong — in either direction — is one of the most expensive mistakes in experimentation.

Why sample size calculations matter

An underpowered test (too few observations) will fail to detect real improvements, leading you to discard winning ideas. An overpowered test (too many observations) wastes traffic and time you could have spent on the next experiment. Both cost you money.

The right sample size depends on three inputs:

Baseline conversion rate. Your current performance. A 2% conversion rate requires a much larger sample than a 20% rate to detect the same relative change.
Minimum detectable effect (MDE). The smallest improvement you care about. If a 1% relative lift isn’t worth implementing, don’t power your test to detect it.
Statistical power and significance level. Typically 80% power and 95% significance (alpha = 0.05). Power is the probability of detecting a real effect when one exists.

The math in practice

For a standard two-proportion z-test with 80% power and 95% significance:

Baseline 5%, MDE 10% relative (to 5.5%): ~58,000 per variant
Baseline 5%, MDE 20% relative (to 6.0%): ~15,000 per variant
Baseline 20%, MDE 10% relative (to 22.0%): ~8,000 per variant

Notice the pattern: lower baseline rates and smaller MDEs require dramatically more traffic. This is why low-traffic sites struggle with experimentation — and why choosing the right MDE is a business decision, not just a statistical one.

Common mistakes

Running until you see significance. This is not a substitute for a sample size calculation. If you stop the moment p < 0.05, you’re guaranteed to overestimate the effect size and inflate false positives.

Ignoring variance in your metric. Revenue per visitor has much higher variance than binary conversion. Revenue-based tests typically need 3-5x more traffic than conversion-based tests.

Not accounting for segments. If you plan to analyze results by device type or traffic source, you need sufficient sample size within each segment, not just overall.

Practical example

Your e-commerce site converts at 3.5% with 10,000 daily visitors. You want to detect a 15% relative lift. At 80% power and 95% significance, you need roughly 22,000 visitors per variant — so a two-variant test needs about 4.5 days. If you want to detect a 5% relative lift, you need ~200,000 per variant — about 40 days. That calculation determines whether the test is worth running at all.

Related Terms

Keep Reading

Work Together

Put This Into Practice

Understanding the theory is step one. Building an experimentation program that applies these concepts systematically — and ties every test to revenue — is where the real impact happens.

View Services Browse Glossary

Lean Experiments Newsletter

Revenue Frameworks
for Growth Leaders

Every week: one experiment, one framework, one insight to make your marketing more evidence-based and your revenue more predictable.

Subscribe free

Free · No spam · Unsubscribe anytime

Browse issues

Read the archive

200+ issues of experiments, frameworks, and field reports from inside a Fortune 150 growth team.

Open Substack

Sample Size

Why sample size calculations matter

The math in practice

Common mistakes

Practical example

Keep Reading

A/B Testing

Minimum Detectable Effect

Sequential Testing

Statistical Significance

Put This Into Practice

Revenue Frameworks for Growth Leaders

Read the archive

Revenue Frameworks
for Growth Leaders