The most common question I get from teams launching their first A/B tests is "how long should we run it?" The most common answer they get from Google is "until you reach statistical significance." Both the question and the answer miss the point entirely.
Test duration is not a matter of patience. It is a matter of math. And the math depends on four variables that most teams never bother to calculate before they hit the start button.
I have seen teams call winners after three days on tests that needed three weeks. I have seen other teams run tests for months because nobody set a stopping rule. Both approaches waste resources and produce unreliable conclusions.
This guide covers how to calculate the right test duration before you launch, why peeking at results early invalidates your conclusions, and the regression-to-the-mean trap that fools even experienced analysts.
The Four Variables That Determine Test Duration
Every A/B test has a mathematically determined sample size. That sample size, combined with your daily traffic, determines how long the test needs to run. The four variables are:
1. Baseline Conversion Rate
This is your current conversion rate for the metric you are testing. If your checkout page converts at 3.2%, that is your baseline. You need this number from your analytics before you design the test.
The baseline matters because the math behaves differently at different conversion rates. Detecting a 10% relative improvement on a 50% conversion rate requires far fewer samples than detecting the same relative improvement on a 2% conversion rate.
2. Minimum Detectable Effect (MDE)
The MDE is the smallest improvement you care about detecting. If your checkout converts at 3.2% and you would not bother implementing a change that lifts it by less than 0.3 percentage points, your MDE is 0.3 percentage points (about a 9.4% relative lift).
Smaller MDEs require dramatically larger sample sizes. This is where most teams get the math wrong. They want to detect a 1% relative improvement with 5,000 visitors. The math does not care about your ambitions. Understanding the statistics behind these calculations is essential for setting realistic expectations.
3. Statistical Significance Level (Alpha)
Alpha is the probability of declaring a winner when there is no real difference — a false positive. The industry standard is 0.05, meaning you accept a 5% chance of a false positive. Lower alpha values (like 0.01) require more samples but give you more confidence in real results.
4. Statistical Power
Power is the probability of detecting a real difference when one exists. The standard is 0.80, meaning you have an 80% chance of finding a real effect. Higher power (like 0.90) catches more real effects but requires more samples.
The Sample Size Formula in Practice
You do not need to memorize the formula. Every reputable A/B testing tool includes a sample size calculator. But you do need to understand what it produces and why.
For a typical scenario — 3% baseline conversion, 10% relative MDE, 95% confidence, 80% power — you need approximately 38,000 visitors per variation. For a standard two-variation test, that is 76,000 total visitors.
If your site gets 5,000 visitors per day to the tested page, your test needs to run for approximately 15 days. Not three days. Not "until we see green." Fifteen days, minimum.
The Full-Week Rule
Always round up to complete weeks. User behavior varies by day of week — Monday shoppers behave differently than Saturday browsers. If your test runs for 15 days, round up to 21 (three full weeks) to capture complete weekly cycles.
This matters more than most teams realize. I have seen tests show positive results during weekdays and negative results on weekends. If you stop the test on a Friday, you capture a biased sample that overweights weekday behavior.
Why Peeking Destroys Your Test
Peeking is checking your test results before the predetermined sample size is reached. It is the single most common way teams invalidate their tests.
Here is why it is dangerous: statistical significance is not a stable property. It fluctuates. If you check a test 20 times before it reaches full sample size, you have roughly a 50% chance of seeing a "significant" result at some point — even when there is no real difference between variants.
This is not a theoretical concern. This is how most organizations actually use A/B testing tools. They check the dashboard daily, see a p-value under 0.05 on day four, and call it a winner. They have just made a business decision based on statistical noise.
The Math Behind the Peeking Problem
Every time you check your test results, you are running a new statistical test. With a 5% significance level, each check has a 5% chance of producing a false positive. Check 20 times and your cumulative false positive rate explodes. This is the multiple comparisons problem, and it is well-documented in the statistical testing literature. The solution is simple: decide your sample size before you start, and do not look until you get there.
Sequential Testing as an Alternative
If you genuinely need to monitor results during the test — for safety reasons, for instance — use sequential testing methods. These adjust the significance threshold at each peek to maintain the overall false positive rate. Bayesian approaches also handle continuous monitoring more naturally than frequentist methods.
But sequential testing is not a free lunch. It typically requires larger sample sizes than a fixed-horizon test. You are trading speed for the ability to monitor.
The Regression to the Mean Trap
Regression to the mean is the phenomenon where extreme results tend to move toward the average over time. In A/B testing, it manifests as early results that look dramatic but fade as more data accumulates.
Here is a concrete example. You launch a test on Monday. By Wednesday, the variant shows a 25% lift with p < 0.01. You are excited. By Friday, the lift has dropped to 12%. By the end of week two, it is 4%. What happened?
Nothing happened. The early result was driven by a small, unrepresentative sample. As the sample grew, it regressed toward the true effect size — which might be 4%, or might be zero. The "decline" was not the variant losing effectiveness. It was the statistics stabilizing.
How to Protect Against It
The only protection against regression to the mean is adequate sample size. If your test requires 40,000 visitors per variation and you only have 2,000, your results are unreliable regardless of what the p-value says.
Do not trust early results. Do not get excited by day-three lifts. Do not get discouraged by day-three drops. Wait for the sample size you calculated before you draw any conclusions.
The Magic Number Myth
There is no universal minimum for how long an A/B test should run. Anyone who tells you "always run tests for two weeks" or "you need at least 1,000 conversions" is giving you a heuristic that may or may not apply to your situation.
The minimum test duration depends entirely on your specific traffic, conversion rate, and the effect size you need to detect. A high-traffic e-commerce site testing a major checkout change might reach significance in four days. A B2B SaaS site testing a pricing page headline might need eight weeks.
Calculate it. Every time. Before you launch.
Running Tests on Low-Traffic Pages
If your sample size calculation tells you a test will take six months, you have three options:
Increase the MDE. Accept that you can only detect larger effects. Instead of looking for a 5% relative lift, look for a 20% lift. This requires bolder changes — which is often better testing strategy anyway.
Aggregate traffic. If you are running multiple tests across different pages, consider whether you can combine similar pages into a single test. Test a CTA button change across all product pages rather than one at a time.
Use variance reduction techniques. Methods like CUPED (Controlled-experiment Using Pre-Experiment Data) can reduce the variance in your metric, which effectively reduces the sample size needed. This is an advanced technique but can cut required sample sizes by 50% or more.
External Factors That Affect Duration
Your sample size calculation assumes a stable environment. In practice, several external factors can extend your required test duration.
Seasonality. Testing during Black Friday is not the same as testing in February. If your test spans a major event, the behavior shift can contaminate your results. Consider whether external validity threats might affect your conclusions.
Marketing campaigns. A new ad campaign can change the composition of your traffic mid-test. If you were testing on organic visitors and suddenly get a flood of paid traffic, your sample is no longer homogeneous.
Product changes. Any change to the product experience outside of your test can act as a confounding variable. This is why test coordination — knowing what else is shipping during your test window — is critical.
Setting Stopping Rules
Before every test, document three things:
- The required sample size — calculated from your four variables
- The expected end date — based on sample size divided by daily traffic
- The early stopping criteria — under what conditions you will stop the test early (only for safety, like a major conversion rate drop)
Share these with your team. Put them in your test brief. When someone asks "is the test done yet?" the answer is either "we have not reached sample size" or "yes, we reached sample size on Tuesday."
This removes the temptation to peek and the political pressure to call winners early.
Pro Tip: Build Duration Into Your Prioritization
Test duration should factor into your prioritization framework. A test that requires eight weeks of runtime has a higher opportunity cost than one that requires two weeks. If two test ideas have similar expected impact, run the faster one first.
This seems obvious, but most prioritization frameworks ignore test duration entirely. They rank by expected impact and ease of implementation, forgetting that a brilliant test idea is worthless if your traffic cannot support it. Factor duration into how you prioritize your testing roadmap.
What to Learn Next
This article covers the math and psychology of test duration. Here is where to go deeper:
- A/B Testing Statistics — understand the statistical foundations behind sample size calculations
- Running Multiple Tests Simultaneously — how to manage duration when you have parallel experiments
- Bayesian vs. Frequentist Testing — an alternative approach that handles continuous monitoring differently
- CUPED and Variance Reduction — advanced techniques for reducing required sample sizes
- External Validity Threats — factors that can make your test results unreliable regardless of duration