Bayesian vs Frequentist A/B Testing: A Practitioner's Honest Guide

Atticus Li

← Blog · ab-testing

Bayesian vs Frequentist A/B Testing: A Practitioner's Honest Guide

A practical comparison of Bayesian and frequentist A/B testing from a CRO practitioner who's run 100+ experiments. The peeking problem, sequential testing, platform configuration and schema design for both methods, and a decision table for choosing the right approach.

By Atticus Li March 31, 2026 10 min read

Every A/B testing debate eventually lands here: Bayesian vs. frequentist. Statisticians write papers about it. Tool vendors use it for marketing. And most practitioners just want to know which one to pick.

I've run over 100 experiments across e-commerce and SaaS platforms, generating $30M+ in revenue impact. Here's what I actually think about this debate — including the part nobody tells you.

The Core Difference: What Number Are You Getting?

This is the most practical way to understand the split. Both approaches analyze the same experiment. They just give you a different number at the end.

Frequentist output: A p-value. This answers: "If there were truly no difference between A and B, how likely is it that I'd see results this extreme by chance?" A p-value of 0.03 means there's a 3% probability of observing this result (or more extreme) under the null hypothesis. Marketers translate this as "97% confident," which is technically wrong but functionally close enough for business decisions.

Bayesian output: A probability that B beats A. This answers: "Given what I've observed, how likely is it that variant B is actually better?" This is what most non-statisticians think they're getting from frequentist tests. "There's an 87% probability B is better than A" — that's intuitive. It's also genuinely what Bayesian calculates.

The difference matters most when communicating results. A p-value of 0.04 confuses stakeholders. "82% probability B beats A" does not.

Frequentist Testing: When It's the Right Call

Frequentist A/B testing has been the default for decades. Tools like Google Optimize (RIP) and early Optimizely used it exclusively. Here's how it works in practice:

You set a significance threshold before the test (typically 95%). You calculate a required sample size. You run the test until you hit that sample size. You look at the p-value once — at the end. Binary result: significant or not.

When frequentist is the right choice:

Regulated industries. Healthcare, finance, legal. Auditors understand p-values and statistical significance. Bayesian posterior probabilities are harder to defend in a compliance review.
Large organizations with review boards. When results need sign-off from a statistics team or external reviewers, frequentist methodology is the shared language.
Stakeholder defensibility. "We ran to statistical significance" is a sentence that most VPs of Product accept without further questioning.
Fixed decision windows. When you need a hard yes/no at a specific date — say, before a product launch — the frequentist fixed-sample approach forces that discipline.

**Pro Tip:** If you're running experiments at a company where results ever go to legal, compliance, or an academic partner, use frequentist. The documentation trail is cleaner and the methodology is more universally understood.

Bayesian Testing: When It's the Right Call

Bayesian testing calculates the probability that one variant beats another, and it can do this continuously — no fixed sample size required. You can check it daily without mathematically compromising the result (more on this below).

When Bayesian is the right choice:

Speed matters more than precision. Startups and small CRO teams that need decisions in 5 days, not 3 weeks.
Low-traffic sites. When you never hit the sample sizes frequentist requires, Bayesian gives you a usable probability estimate rather than an inconclusive p-value.
Iterative programs. If you're running 4+ tests per month, Bayesian's continuous monitoring fits the pace better.
Intuitive stakeholder communication. "There's a 91% chance the new checkout beats the old one" lands better in a board meeting than "p = 0.04."

**Pro Tip:** Most Bayesian implementations let you set a "minimum detectable effect" or "risk threshold" — e.g., stop when the probability of being wrong costs less than $500/week. This is more actionable than waiting for an arbitrary confidence level.

Sequential Testing: The Middle Ground You're Actually Using

Here's the part most articles skip: if you're using Optimizely, you're probably using neither pure frequentist nor pure Bayesian. You're using sequential testing, specifically Optimizely's Stats Engine.

Sequential testing is a frequentist method that corrects for the peeking problem mathematically, allowing you to look at results continuously without inflating your false positive rate. It uses a technique called "always valid inference" — the p-value stays valid regardless of when you check it.

This is different from Bayesian. It's still producing a p-value and a confidence interval. But it behaves more like Bayesian in practice because you can check the dashboard at any time without the "you're cheating" objection.

The practical implication: If you're using Optimizely's Stats Engine, you don't need to choose between Bayesian and frequentist. The engine has already made that choice for you in a way that combines practical advantages of both.

**Pro Tip:** When someone asks "are you using Bayesian or frequentist?", the correct answer for Optimizely users is "sequential frequentist." This matters if you're ever defending your methodology to a statistics-aware stakeholder.

The Peeking Problem: The Real Reason This Debate Exists

Peeking is looking at your test results before you've hit your predetermined sample size and making a stopping decision based on what you see.

In classical frequentist statistics, peeking destroys your analysis. Here's why with numbers:

Suppose your test requires 20,000 visitors per variation for 95% significance at 80% power. You check at 5,000 visitors and see p = 0.03. You call it significant and stop. The problem: if you run this scenario 100 times on a true null (no actual difference), you'll see "significant" results ~26% of the time rather than the 5% your alpha level was supposed to guarantee. You've inflated your false positive rate by 5x.

How each approach handles peeking:

Classical frequentist: You can't peek. At all. This is why so many frequentist-run programs produce garbage data — nobody actually follows the rule.
Bayesian: The posterior probability is always a valid statement of current evidence. There's no mathematical inflation from checking. You're just getting an updated estimate each time.
Sequential (Optimizely Stats Engine): Designed explicitly for peeking. The confidence intervals are wider early in the experiment and tighten as data accumulates. You can stop at any point and the inference remains valid.

**Pro Tip:** If your team is running frequentist tests and checking results every day, your false positive rate is significantly higher than your stated confidence level. Either switch to sequential testing, enforce the no-peeking rule, or accept that some of your "winners" aren't.

Communicating Results to Non-Technical PMs

This is where the rubber meets the road. You need to translate your stats output into something that informs a shipping decision.

Frequentist translation:

p = 0.03, 95% confidence interval: [+1.2%, +4.8%] on CVR
PM-friendly: "We're 95% confident the effect is real, and the most likely lift is somewhere between 1.2 and 4.8 percentage points."
Decision: Ship if +1.2% lift is worth the development cost.

Bayesian translation:

89% probability B beats A, expected lift +2.1%, risk of being wrong: -0.4% CVR
PM-friendly: "There's an 89% chance this is a real improvement. If we're wrong, we'd expect to lose about 0.4 percentage points."
Decision: Ship if the risk-adjusted expected value is positive.

Bayesian is genuinely easier to act on. The risk quantification ("we'd lose 0.4pp if wrong") is something a product team can actually use in a go/no-go decision.

Decision Table: Which Approach to Use

Scenario	Recommended Approach
Regulated industry (healthcare, finance)	Frequentist
Startup, <50K monthly sessions	Bayesian
Using Optimizely's Stats Engine	Sequential (already configured)
Need to defend results to external auditors	Frequentist
Running 6+ tests/month, fast iteration	Bayesian or Sequential

**Pro Tip:** For teams just starting out, use whatever your testing platform defaults to. Getting the methodology right matters far less than running well-designed tests. Fight the statistics battle after you've built the habit of testing.

Common Mistakes

Mistake 1: Calling frequentist tests "Bayesian" because you check them daily. Checking a frequentist test daily and stopping when it looks good is just peeking with extra steps. Sequential testing is what enables valid daily checking.

Mistake 2: Using Bayesian to justify stopping tests after 3 days. Bayesian handles the statistical validity of early stopping. It doesn't handle regression to the mean, day-of-week effects, or novelty bias. You still need minimum duration requirements.

Mistake 3: Treating the Bayesian probability as a guarantee. "95% probability B beats A" still means there's a 5% chance you're making the wrong call. Over 40 experiments, you'll ship 2 false positives. Plan for it.

Mistake 4: Re-running a failed frequentist test with Bayesian to "get a win." Different statistical framework, same data, different-looking output. This is methodological shopping, not improved analysis.

What to Do Next

If you're running experiments on Optimizely, you're already using sequential testing — read the Optimizely Practitioner Toolkit for the full setup guide including how to interpret Stats Engine results.

If you're evaluating testing platforms and the statistical approach matters to your organization, the decision table above should guide you. Most teams running fewer than 10 tests per month won't notice a practical difference between approaches. Most teams running 20+ tests per month will benefit from Bayesian or sequential.

The most important thing isn't which statistics framework you choose. It's whether you're actually following its rules.

Platform Configuration: A/B Testing Platform Schema for Sequential, Bayesian, and Frequentist Analysis

Choosing between Bayesian and frequentist analysis is not a one-time decision. A mature experimentation platform supports both because different experiment types call for different methods. The schema design choices below are what make both methods maintainable in the same system without producing silent contradictions.

Experiment Results Schema

Each experiment result row carries the analysis method as an explicit field, not as an inferred attribute. Downstream consumers — dashboards, alerts, decision logs — need to know whether they are looking at a p-value or a posterior probability. Mixing them silently in the same column is the most common source of misinterpretation in mature programs. Add a method enum, a confidence-or-credible-interval pair (with explicit type), and the prior or alpha used.

Sequential Testing Configuration

Frequentist analysis with sequential testing requires a configuration field that says "this experiment uses an alpha-spending function — do not interpret intermediate p-values as fixed-horizon p-values." Without that flag, dashboards will show p-values that look conclusive at week two but are not. Bayesian analysis is sequentially valid by default, but you still need a configuration field for the prior. A weakly informative prior is the right default for most product experiments. A flat prior is wrong for most cases — it inflates the impact of small samples.

Decision Configuration

The schema captures three thresholds per experiment: the threshold for shipping the variant, the threshold for archiving as inconclusive, and the threshold for considering the variant harmful. For Bayesian, posterior probability cutoffs. For frequentist, alpha and a minimum detectable effect for the inconclusive case. Storing these per experiment means the decision is made before the data arrives, not after. Method-switching mid-experiment is invalid; the configuration field exists in part to make that switch impossible without an explicit migration.

Mutual Exclusion and the Assignment Layer

Both methods need consistent assignment, but mutual exclusion (preventing a user from being in two conflicting experiments) is a separate problem from analysis method. The assignment layer hashes user ID against an experiment-group identifier; mutually exclusive experiments share a group, non-exclusive experiments do not. The same group structure works for both Bayesian and frequentist analysis, because exclusion is about exposure, not about how you analyze the data.

ab-testing experimentation statistical-significance

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified behavioral economist (1 of ~1,000 worldwide). 200+ A/B tests across energy, SaaS, fintech, e-commerce, and marketplace verticals.

About LinkedIn Newsletter

Bayesian vs Frequentist A/B Testing: A Practitioner's Honest Guide

The Core Difference: What Number Are You Getting?

Frequentist Testing: When It's the Right Call

Bayesian Testing: When It's the Right Call

Sequential Testing: The Middle Ground You're Actually Using

The Peeking Problem: The Real Reason This Debate Exists

Communicating Results to Non-Technical PMs

Decision Table: Which Approach to Use

Common Mistakes

What to Do Next

Platform Configuration: A/B Testing Platform Schema for Sequential, Bayesian, and Frequentist Analysis

Experiment Results Schema

Sequential Testing Configuration

Decision Configuration

Mutual Exclusion and the Assignment Layer

Three places this work shows up.

GrowthLayer

Consulting

Jobsolv

Get the Weekly
Experimentation Playbook

The Core Difference: What Number Are You Getting?

Frequentist Testing: When It's the Right Call

Bayesian Testing: When It's the Right Call

Sequential Testing: The Middle Ground You're Actually Using

The Peeking Problem: The Real Reason This Debate Exists

Communicating Results to Non-Technical PMs

Decision Table: Which Approach to Use

Common Mistakes

What to Do Next

Platform Configuration: A/B Testing Platform Schema for Sequential, Bayesian, and Frequentist Analysis

Experiment Results Schema

Sequential Testing Configuration

Decision Configuration

Mutual Exclusion and the Assignment Layer

Related Articles

Activation Metrics: How to Pick the One That Predicts Retention

How to Write A/B Test Hypotheses That Actually Hold Up

The Commitment Trap: Why Forcing Users to Opt-In Destroys Conversions (and What Loss Aversion Actually Predicts)

Related Articles

Activation Metrics: How to Pick the One That Predicts Retention

How to Write A/B Test Hypotheses That Actually Hold Up

The Commitment Trap: Why Forcing Users to Opt-In Destroys Conversions (and What Loss Aversion Actually Predicts)

Three places this work shows up.

GrowthLayer

Consulting

Jobsolv

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook