Pre-Test and Post-Test Calculators: Statistical Guardrails and the Cost of Statistical Debt
TL;DR: Running experiments without proper pre-test and post-test statistical tools is the experimentation equivalent of shipping code without tests. The debt accumulates silently, and it comes due when a shipped false positive costs the team credibility it spent years building.
Key Takeaways
- Pre-test calculators determine required sample size and runtime; skipping them produces underpowered tests that systematically miss real effects
- Post-test calculators validate whether observed effects are statistically sound; skipping them produces shipped false positives that compound into credibility damage
- Statistical Debt is the accumulated cost of tests run without proper pre/post calculators — it behaves like technical debt, compounding silently until a visible failure forces a reckoning
- Minimum Detectable Effect (MDE) is the most misunderstood parameter: too small and you can never run enough tests; too large and you systematically miss meaningful wins
- Guardrail metrics and pre-declared success criteria are the single most effective defense against post-hoc interpretation bias
Underpowered Tests Are Invisible Failures
The most common failure mode in experimentation isn't false positives shipped. It's false negatives accepted — real effects that existed in reality but weren't detected because the test was underpowered.
This is the experimentation version of the planning fallacy. Teams are systematically optimistic about what their sample size can detect. "We got 10,000 visitors, that's plenty" fails to account for what the Minimum Detectable Effect actually is at that sample size. If your MDE is 5% but the real effect is 2%, you'll run the test, see noise, conclude "no effect," and kill a variant that was actually winning.
The downstream cost is larger than any single test. It's a systematic bias toward believing nothing works — because you keep missing the things that do work. This creates a learned helplessness in experimentation programs: teams stop proposing bold hypotheses because "tests never show anything."
Pre-test calculators solve this at the source. By forcing explicit calculation of required sample size before launch, they make underpowered tests visible before they're run.
"Teams don't skip sample size math because they're lazy. They skip it because nobody taught them, and the tool didn't either." — Atticus Li
The Statistical Debt Framework
Every test you run without proper pre-test and post-test discipline adds to Statistical Debt. Like technical debt, it accumulates silently and compounds.
Statistical Debt = (Tests run without pre-test power analysis) + (Tests stopped without post-test validation) + (Tests shipped on post-hoc metrics)
These three categories each represent a specific failure mode:
Without pre-test power analysis: You don't know if the sample was large enough to detect the effect you care about. Your null results are uninterpretable.
Without post-test validation: You don't know if the result you observed is robust. Maybe it's SRM-affected. Maybe it's a day-of-week artifact. You shipped without checking.
On post-hoc metrics: You tracked 12 metrics, one of them looked significant, and you shipped on that one. Multiple-comparisons math says this happens by chance alone at significance thresholds most teams use.
Interpretation thresholds:
- Debt near zero — Every test has pre-declared MDE, sample size, and success criteria. Every result gets SRM and significance checks before shipping.
- Debt moderate — Pre-test discipline is partial. Post-test validation happens for high-stakes tests only.
- Debt high — Most tests run without power analysis. Ship decisions happen on the first significant-looking metric.
- Debt critical — The team no longer trusts its own results, and every proposal faces executive skepticism.
The framework is useful because debt is invisible until it isn't. One shipped false positive with visible consequences (a feature that tanks retention, a pricing change that drops revenue) makes all the silently-accumulated debt suddenly visible.
What Pre-Test Calculators Actually Do
A pre-test calculator takes your baseline metric, desired MDE, statistical significance threshold, and power target — and tells you how many users per variant you need and for how long.
Key inputs:
- Baseline conversion rate or metric value — What the current state produces. Usually a 4-week rolling average.
- Minimum Detectable Effect (MDE) — The smallest relative change you care about. A 2% MDE says "if the real effect is smaller than 2%, I don't need to detect it."
- Significance level (α) — Usually 0.05. The probability of a false positive you're willing to accept.
- Statistical power (1-β) — Usually 0.8. The probability of detecting a real effect if it exists.
Output: Required sample size per variant and estimated runtime given your current traffic.
The trap most teams fall into is setting MDE too low. "We want to detect a 1% lift" sounds reasonable until the calculator tells you that requires 80,000 users per variant and 6 weeks of runtime. At that point teams either accept the runtime, widen the MDE, or (the failure mode) ignore the calculator and run the test anyway at insufficient sample size.
Choosing MDE Correctly
MDE is the most misunderstood parameter in experimentation. Two failure modes dominate:
MDE too small. Every test requires an impractical sample size. Teams either never launch tests or launch them at insufficient power.
MDE too large. Every test has achievable sample size, but real effects smaller than the MDE are systematically invisible. You miss a lot of real wins.
The correct MDE is the smallest effect that would change a business decision. If a 1.5% lift in conversion would materially change your roadmap, MDE should be ≤ 1.5%. If anything below 5% wouldn't change what you do, MDE can be 5% and you save sample size.
Most teams set MDE by guessing. A disciplined team derives MDE from: current conversion rate × expected traffic × revenue per conversion × required ROI for the change to be worth shipping.
What Post-Test Calculators Actually Do
Post-test calculators validate whether observed results are real. Three checks matter:
Significance check. Given the observed difference and sample size, is the effect statistically significant at your pre-declared threshold? Standard frequentist significance testing.
SRM check. Did the actual traffic split match the designed split? A chi-squared test catches 49.5/50.5 splits that look fine but signal upstream problems.
Confidence interval check. What's the range of plausible true effects given the observed data? A significant result with a confidence interval of [0.1%, 8%] is very different from [3%, 5%] — same significance, different confidence about magnitude.
The confidence interval check is the one teams skip most often, and it's often where the real story lives. A result at the edge of significance with a wide confidence interval is not a clear win — it's a "consider running a longer test" signal.
Guardrail Metrics: The Underrated Defense
Beyond the primary metric, every test should have guardrail metrics with pre-declared fail-stop thresholds. Common guardrails:
- Page load time (don't ship a feature that slows the site)
- Error rate (don't ship a variant that breaks)
- Retention metrics (don't ship a primary-metric winner that costs long-term users)
- Revenue per visitor (don't ship a conversion-rate winner that sells lower-value bundles)
The point of guardrails isn't to pass/fail tests on them. It's to prevent shipping primary-metric winners that create secondary-metric disasters. A conversion lift that comes with a 20% retention loss is not a win.
Common Mistakes in Statistical Guardrails
Peeking. Checking results multiple times before the test is done and stopping "when it looks good." This inflates false positive rates dramatically — a 5% significance threshold checked daily for a week has an effective false positive rate closer to 20%.
Ignoring the confidence interval. Reporting "2.3% lift, p = 0.04" without the confidence interval hides whether the true effect is probably 0.5% or 5%.
Chasing secondary metrics. When the primary metric is inconclusive and a secondary metric moves, the temptation to reframe the secondary as the real story is strong. Resist it.
Setting MDE based on what sample size allows. Calculating "we have 10,000 users, so MDE must be 5%" is backwards. MDE should be set by business relevance, then sample size or runtime adjusted to deliver it.
Skipping A/A tests. An occasional A/A test (where both variants are identical) validates that your experimentation infrastructure produces the null result it should. Most teams skip these and discover infrastructure problems only when they've shipped false positives.
Advanced: Bayesian Sequential Testing
For teams running high volume, frequentist fixed-sample methods create a velocity problem. You wait for the full sample even when the signal is obvious.
Bayesian sequential testing allows continuous evaluation with proper probability interpretation. You can stop when the posterior probability of a win exceeds your threshold, without the peeking problem. The tradeoffs: you need to specify priors (which is its own discipline), and tooling is less universal than frequentist methods.
Most teams don't need Bayesian until they're past 40+ tests per quarter. Below that volume, frequentist discipline plus occasional A/A validation is enough.
Frequently Asked Questions
What's the minimum statistical discipline to adopt first?
Pre-test power analysis for every launched test. If you do nothing else, do this. It prevents the silent accumulation of underpowered tests that teach you nothing.
How long should a test run?
Long enough to hit the sample size the pre-test calculator specified, and never less than one full week to capture day-of-week effects. For features with habitual use, 2-4 weeks to see novelty effects stabilize.
Can I trust a significant result at the minimum sample size?
The statistical math is correct at minimum sample size — you hit the power level you designed. But confidence intervals will be wider than if you had run longer. If the confidence interval is concerning, consider running longer before shipping.
What do I do with an inconclusive result?
Inconclusive means the test didn't have enough power to distinguish the effect from zero. Options: run longer to increase power, accept that effects below your MDE aren't worth detecting, or redesign the test to focus on larger hypothesized effects.
Should I use a one-tailed or two-tailed test?
Two-tailed almost always. The occasions where you genuinely only care about one direction (and would be fine with any negative effect) are rare. Two-tailed is the safer default.
Methodology note: Statistical Debt framework and threshold patterns reflect experience across experimentation programs running 20-80 tests per quarter. Specific figures are presented as ranges. Frequentist statistical practice follows standard conventions documented in Kohavi, Tang, and Xu's "Trustworthy Online Controlled Experiments."
---
Pre-test and post-test discipline becomes much easier when you can see past tests' sample sizes and outcomes. Browse the GrowthLayer test library for reference patterns.
Related reading: