Every A/B testing debate eventually lands here: Bayesian vs. frequentist. Statisticians write papers about it. Tool vendors use it for marketing. And most practitioners just want to know which one to pick.
I've run over 100 experiments across e-commerce and SaaS platforms, generating $30M+ in revenue impact. Here's what I actually think about this debate — including the part nobody tells you.
The Core Difference: What Number Are You Getting?
This is the most practical way to understand the split. Both approaches analyze the same experiment. They just give you a different number at the end.
Frequentist output: A p-value. This answers: "If there were truly no difference between A and B, how likely is it that I'd see results this extreme by chance?" A p-value of 0.03 means there's a 3% probability of observing this result (or more extreme) under the null hypothesis. Marketers translate this as "97% confident," which is technically wrong but functionally close enough for business decisions.
Bayesian output: A probability that B beats A. This answers: "Given what I've observed, how likely is it that variant B is actually better?" This is what most non-statisticians think they're getting from frequentist tests. "There's an 87% probability B is better than A" — that's intuitive. It's also genuinely what Bayesian calculates.
The difference matters most when communicating results. A p-value of 0.04 confuses stakeholders. "82% probability B beats A" does not.
Frequentist Testing: When It's the Right Call
Frequentist A/B testing has been the default for decades. Tools like Google Optimize (RIP) and early Optimizely used it exclusively. Here's how it works in practice:
You set a significance threshold before the test (typically 95%). You calculate a required sample size. You run the test until you hit that sample size. You look at the p-value once — at the end. Binary result: significant or not.
When frequentist is the right choice:
- Regulated industries. Healthcare, finance, legal. Auditors understand p-values and statistical significance. Bayesian posterior probabilities are harder to defend in a compliance review.
- Large organizations with review boards. When results need sign-off from a statistics team or external reviewers, frequentist methodology is the shared language.
- Stakeholder defensibility. "We ran to statistical significance" is a sentence that most VPs of Product accept without further questioning.
- Fixed decision windows. When you need a hard yes/no at a specific date — say, before a product launch — the frequentist fixed-sample approach forces that discipline.
**Pro Tip:** If you're running experiments at a company where results ever go to legal, compliance, or an academic partner, use frequentist. The documentation trail is cleaner and the methodology is more universally understood.
Bayesian Testing: When It's the Right Call
Bayesian testing calculates the probability that one variant beats another, and it can do this continuously — no fixed sample size required. You can check it daily without mathematically compromising the result (more on this below).
When Bayesian is the right choice:
- Speed matters more than precision. Startups and small CRO teams that need decisions in 5 days, not 3 weeks.
- Low-traffic sites. When you never hit the sample sizes frequentist requires, Bayesian gives you a usable probability estimate rather than an inconclusive p-value.
- Iterative programs. If you're running 4+ tests per month, Bayesian's continuous monitoring fits the pace better.
- Intuitive stakeholder communication. "There's a 91% chance the new checkout beats the old one" lands better in a board meeting than "p = 0.04."
**Pro Tip:** Most Bayesian implementations let you set a "minimum detectable effect" or "risk threshold" — e.g., stop when the probability of being wrong costs less than $500/week. This is more actionable than waiting for an arbitrary confidence level.
Sequential Testing: The Middle Ground You're Actually Using
Here's the part most articles skip: if you're using Optimizely, you're probably using neither pure frequentist nor pure Bayesian. You're using sequential testing, specifically Optimizely's Stats Engine.
Sequential testing is a frequentist method that corrects for the peeking problem mathematically, allowing you to look at results continuously without inflating your false positive rate. It uses a technique called "always valid inference" — the p-value stays valid regardless of when you check it.
This is different from Bayesian. It's still producing a p-value and a confidence interval. But it behaves more like Bayesian in practice because you can check the dashboard at any time without the "you're cheating" objection.
The practical implication: If you're using Optimizely's Stats Engine, you don't need to choose between Bayesian and frequentist. The engine has already made that choice for you in a way that combines practical advantages of both.
**Pro Tip:** When someone asks "are you using Bayesian or frequentist?", the correct answer for Optimizely users is "sequential frequentist." This matters if you're ever defending your methodology to a statistics-aware stakeholder.
The Peeking Problem: The Real Reason This Debate Exists
Peeking is looking at your test results before you've hit your predetermined sample size and making a stopping decision based on what you see.
In classical frequentist statistics, peeking destroys your analysis. Here's why with numbers:
Suppose your test requires 20,000 visitors per variation for 95% significance at 80% power. You check at 5,000 visitors and see p = 0.03. You call it significant and stop. The problem: if you run this scenario 100 times on a true null (no actual difference), you'll see "significant" results ~26% of the time rather than the 5% your alpha level was supposed to guarantee. You've inflated your false positive rate by 5x.
How each approach handles peeking:
- Classical frequentist: You can't peek. At all. This is why so many frequentist-run programs produce garbage data — nobody actually follows the rule.
- Bayesian: The posterior probability is always a valid statement of current evidence. There's no mathematical inflation from checking. You're just getting an updated estimate each time.
- Sequential (Optimizely Stats Engine): Designed explicitly for peeking. The confidence intervals are wider early in the experiment and tighten as data accumulates. You can stop at any point and the inference remains valid.
**Pro Tip:** If your team is running frequentist tests and checking results every day, your false positive rate is significantly higher than your stated confidence level. Either switch to sequential testing, enforce the no-peeking rule, or accept that some of your "winners" aren't.
Communicating Results to Non-Technical PMs
This is where the rubber meets the road. You need to translate your stats output into something that informs a shipping decision.
Frequentist translation:
- p = 0.03, 95% confidence interval: [+1.2%, +4.8%] on CVR
- PM-friendly: "We're 95% confident the effect is real, and the most likely lift is somewhere between 1.2 and 4.8 percentage points."
- Decision: Ship if +1.2% lift is worth the development cost.
Bayesian translation:
- 89% probability B beats A, expected lift +2.1%, risk of being wrong: -0.4% CVR
- PM-friendly: "There's an 89% chance this is a real improvement. If we're wrong, we'd expect to lose about 0.4 percentage points."
- Decision: Ship if the risk-adjusted expected value is positive.
Bayesian is genuinely easier to act on. The risk quantification ("we'd lose 0.4pp if wrong") is something a product team can actually use in a go/no-go decision.
Decision Table: Which Approach to Use
| Scenario | Recommended Approach | |---|---| | Regulated industry (healthcare, finance) | Frequentist | | Startup, <50K monthly sessions | Bayesian | | Using Optimizely's Stats Engine | Sequential (already configured) | | Need to defend results to external auditors | Frequentist | | Running 6+ tests/month, fast iteration | Bayesian or Sequential |
**Pro Tip:** For teams just starting out, use whatever your testing platform defaults to. Getting the methodology right matters far less than running well-designed tests. Fight the statistics battle after you've built the habit of testing.
Common Mistakes
Mistake 1: Calling frequentist tests "Bayesian" because you check them daily. Checking a frequentist test daily and stopping when it looks good is just peeking with extra steps. Sequential testing is what enables valid daily checking.
Mistake 2: Using Bayesian to justify stopping tests after 3 days. Bayesian handles the statistical validity of early stopping. It doesn't handle regression to the mean, day-of-week effects, or novelty bias. You still need minimum duration requirements.
Mistake 3: Treating the Bayesian probability as a guarantee. "95% probability B beats A" still means there's a 5% chance you're making the wrong call. Over 40 experiments, you'll ship 2 false positives. Plan for it.
Mistake 4: Re-running a failed frequentist test with Bayesian to "get a win." Different statistical framework, same data, different-looking output. This is methodological shopping, not improved analysis.
What to Do Next
If you're running experiments on Optimizely, you're already using sequential testing — read the Optimizely Practitioner Toolkit for the full setup guide including how to interpret Stats Engine results.
If you're evaluating testing platforms and the statistical approach matters to your organization, the decision table above should guide you. Most teams running fewer than 10 tests per month won't notice a practical difference between approaches. Most teams running 20+ tests per month will benefit from Bayesian or sequential.
The most important thing isn't which statistics framework you choose. It's whether you're actually following its rules.