Your test reached full sample size. The dashboard shows green. The variant won. Time to ship it, right?

Not yet. The dashboard green light is the beginning of analysis, not the end. A proper analysis goes beyond the topline result to understand what actually happened, who it happened to, and whether you should trust it.

I have seen teams ship "winning" variants that produced no measurable lift in production. The test said +12%. Reality said +0%. The gap was always in the analysis — or rather, in the analysis they skipped.

This guide covers how to properly analyze A/B test results, including segmentation, effect size interpretation, guardrail metric review, and the honest reporting practices that build long-term credibility for your experimentation program.

Before You Start: The Analysis Protocol

Analysis is the final stage of the experimentation process, but your analysis plan should have been written before the test launched. If you are deciding what to measure after seeing the results, you are not analyzing — you are fishing for significance.

Your pre-registered analysis plan should include: the primary metric and decision criteria, guardrail metrics and their thresholds, planned segments for secondary analysis, and the statistical method you are using to evaluate results.

With that foundation in place, here is the step-by-step analysis workflow.

Step 1: Validate the Data

Before looking at any results, validate that the test ran correctly. Check these items:

  • Sample ratio mismatch (SRM). If you set a 50/50 split, the actual split should be close to 50/50. A 52/48 split on 100,000 users is suspicious. Run a chi-squared test on the allocation — if it is statistically significant, your randomization was compromised and the test results are unreliable.
  • Data completeness. Check for gaps in data collection. Did tracking drop out for any period? Were there server outages that affected only one variation? Missing data biases results in unpredictable ways.
  • Sample size reached. Confirm that the test reached the pre-calculated sample size. If it did not, the results are underpowered and you should not draw conclusions from them.
  • Full-week cycles. Verify the test ran for complete weeks to avoid day-of-week bias. A test that ran Monday through Friday will overrepresent weekday behavior.

If any of these checks fail, stop. Do not proceed to interpretation. Fix the issue or acknowledge the limitation before drawing any conclusions.

Step 2: Evaluate the Primary Metric

Look at three things for your primary metric: the point estimate, the confidence interval, and the practical significance.

Point Estimate

The point estimate is your best guess of the true effect size. If the variant converted at 4.1% versus the control at 3.8%, the point estimate of the absolute lift is 0.3 percentage points (about 7.9% relative lift).

Confidence Interval

The confidence interval tells you the range of plausible true effect sizes. A 95% confidence interval of [0.1%, 0.5%] means the true lift is probably somewhere between 0.1 and 0.5 percentage points. The interval matters more than the point estimate because it shows you the uncertainty.

A narrow confidence interval means you have a precise estimate. A wide interval means you need more data or should have run the test longer.

Practical Significance

Statistical significance tells you the effect is probably real. Practical significance tells you the effect is worth caring about. A 0.02% conversion rate improvement might be statistically significant with enough data, but it is not worth the engineering effort to implement.

Always translate the effect size into business terms. What does a 0.3 percentage point lift on your pricing page mean in annual revenue? If the answer is $50,000, ship it. If the answer is $500, question whether there are better uses of your implementation capacity.

Step 3: Check Guardrail Metrics

A variant that wins on the primary metric but degrades guardrails is not a winner. Check every guardrail metric you defined before the test.

Common guardrail failures include:

  • Increased bounce rate (users leave faster despite the "winning" metric)
  • Decreased downstream conversion (more clicks but fewer purchases)
  • Increased page load time (variant is slower, causing a hidden selection bias)
  • Higher support ticket volume (the change confused users)

If a guardrail shows significant degradation, the test is not a clear win regardless of what the primary metric says. You need a deeper investigation or a decision about acceptable tradeoffs.

Step 4: Segment the Results

Topline results hide important variation. A test with a +5% average lift might be +15% for mobile users and -3% for desktop users. Segmentation analysis reveals these heterogeneous effects and often changes the implementation decision.

Pre-Planned Segments

These are the segments you defined before the test. They have full statistical validity. Common pre-planned segments include:

  • Device type (mobile vs. desktop)
  • Traffic source (organic vs. paid vs. direct)
  • New vs. returning visitors
  • Geographic region
  • Customer tier or plan type

Exploratory Segments

Exploratory segmentation is looking at cuts you did not plan in advance. This is fine for generating hypotheses for future tests, but treat the results with caution. When you check 20 segments, you expect one to show a "significant" result by chance alone.

The rule is simple: pre-planned segments inform decisions. Exploratory segments inform future tests. Confusing the two is one of the most common validity threats in experimentation programs.

Step 5: Interpret the Effect Size

A common trap is interpreting the point estimate as the guaranteed outcome. If your test shows a +8% lift, do not promise stakeholders +8% in production. The true effect could be anywhere in the confidence interval.

Be conservative. Report the lower bound of the confidence interval as the "worst case" and the point estimate as the "expected case." This builds credibility. Stakeholders who are consistently surprised by results beating forecasts trust the testing program. Stakeholders who consistently see results fall short lose faith.

Also watch for regression to the mean. Test results measured during the experiment often overestimate the long-term effect. This is because the experiment captures a specific time window that may not represent average conditions.

Handling Inconclusive Results

Not every test produces a clear winner. Many tests are inconclusive — the difference between variants is not statistically significant. This is not a failure. It is a valid result that tells you the change does not have a detectable impact at the effect size you cared about.

When a test is inconclusive:

  • Do not extend the test hoping for significance. This is p-value hacking. If you need more data, calculate a new sample size and run a new test.
  • Do not call it a failure. An inconclusive test still taught you that the change does not have a large effect. That is useful information.
  • Document it thoroughly. Future team members need to know this was tested and what the result was, so they do not waste time re-running the same experiment.

The Analysis Report

Every test should produce an analysis report that gets added to your test archives. This is how you compound institutional knowledge over time. The report should include:

  1. Test summary — hypothesis, what was tested, test duration
  2. Results — primary metric outcome with confidence interval, guardrail metrics, key segments
  3. Decision — ship, do not ship, or iterate. Include the reasoning.
  4. Learnings — what did this test teach you about user behavior, regardless of the outcome?
  5. Follow-up actions — future tests suggested by the results, questions raised for further investigation

Honest Reporting

The fastest way to destroy an experimentation program is to oversell results. The second fastest way is to hide negative results.

Report honestly. Show the confidence intervals, not just the point estimate. Highlight guardrail concerns. Present inconclusive tests with the same rigor as winners. When a test loses, explain what you learned and how it informs the next experiment.

The organizations with the strongest experimentation cultures are the ones that celebrate learning from losses as much as they celebrate wins. If your culture only rewards winners, people will find ways to manufacture them — and your program will produce impressive-looking reports backed by unreliable data.

Pro Tip: The Post-Implementation Check

After you ship a winning variant, monitor the production metric for two to four weeks. Compare the actual lift to the expected lift from the test. This post-implementation check serves two purposes.

First, it catches implementation errors. If the test showed +8% but the production metric is flat, something went wrong in the implementation — the winning variant was not deployed correctly, or the test environment differed from production in ways that mattered.

Second, it calibrates your testing program. If test results consistently overestimate production impact, you have a systematic bias (often caused by novelty effects or seasonal alignment). Track the ratio of predicted lift to actual lift over time and use it to adjust future projections.

What to Learn Next

This article covers the analysis workflow. Here is where to go deeper:

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.