Experimentation has its own language. Statistical significance, p-values, null hypotheses, effect sizes — these terms get thrown around in meetings, documentation, and vendor pitches, often imprecisely. Misunderstanding them leads to bad decisions: stopping tests too early, implementing false positives, or dismissing valid results because the terminology was confusing.
This glossary defines every term you'll encounter in a serious experimentation program. Each definition emphasizes practical meaning — what the concept means for your business decisions, not just its statistical textbook definition.
Core Experiment Components
Control
The unchanged, original version that serves as your baseline for comparison. In an A/B test, the control is what your visitors currently experience. Its purpose is to answer the counterfactual question: "What would have happened if we'd changed nothing?" Without a control, you cannot attribute any difference in outcomes to the change you made — you'd be measuring seasonality, traffic mix shifts, and dozens of other variables alongside your change.
Variant (Treatment)
The modified version of the experience that embodies your hypothesis. In a standard A/B test, there is one variant. In an A/B/n test, there are multiple. The variant should differ from the control in exactly one way to maintain experimental clarity. When multiple elements change simultaneously, you lose the ability to determine which change drove the result.
Hypothesis
A testable prediction about what will happen and why. A proper hypothesis has three parts: the proposed change, the expected effect, and the reasoning. For example: "Changing the CTA from 'Learn More' to 'Start Free Trial' will increase sign-up rate by 10% because it reduces perceived commitment and makes the next step concrete." The hypothesis should be written before the test is designed. It serves as the foundation for calculating sample size, choosing metrics, and interpreting results.
Statistical Concepts
Statistical Significance
A result is statistically significant when the observed difference between control and variant is unlikely to have occurred by random chance alone. The threshold for "unlikely" is set by your significance level (alpha), typically 5%. If a result is significant at the 5% level, it means there's less than a 5% probability that you'd see a difference this large if the variant had no real effect.
Critical nuance: Statistical significance does not mean "important" or "meaningful." A 0.1% improvement can be statistically significant with enough traffic. Whether that improvement is worth implementing is a business decision, not a statistical one.
P-Value
The probability of observing a result as extreme as (or more extreme than) your test result, assuming the null hypothesis is true. In plain language: if your variant actually has no effect, how likely is it that you'd still see data this different from the control?
A p-value of 0.03 means there's a 3% chance the observed difference is due to random variation. Most experiments use a significance threshold (alpha) of 0.05, meaning you declare significance when p < 0.05. A lower p-value provides stronger evidence against the null hypothesis, but it does not tell you the size of the effect or its practical importance.
Confidence Interval
A range of values within which the true effect of your variant likely falls, expressed with a confidence level (usually 95%). If the 95% confidence interval for your variant's effect on conversion rate is [+1.2%, +4.8%], you can be 95% confident that the true effect lies somewhere in that range.
Confidence intervals are more informative than p-values alone because they convey both the estimated effect and the precision of that estimate. A narrow confidence interval ([+2.8%, +3.2%]) indicates a precise estimate. A wide one ([+0.1%, +6.5%]) indicates significant uncertainty about the true effect size.
Null Hypothesis
The default assumption that there is no difference between the control and variant — that any observed difference is due to random chance. Your A/B test is essentially asking whether the data provides sufficient evidence to reject this assumption.
Failing to reject the null hypothesis doesn't mean your variant has no effect. It means you didn't collect enough evidence to conclusively say it does. This distinction matters: "no evidence of effect" is not the same as "evidence of no effect."
Sample Size
The number of observations (typically visitors or sessions) needed in each group to detect a given effect size with a specified level of confidence and power. Sample size is calculated before the test begins and depends on four inputs: your baseline conversion rate, the minimum detectable effect (MDE), the significance level (alpha), and the desired statistical power.
Calculating sample size in advance is one of the most important steps in experiment design. It determines how long the test must run and prevents two common mistakes: stopping too early (leading to false positives) and running too long (wasting traffic on a test that's already conclusive).
Effect Size
The magnitude of the difference between the control and variant, typically expressed as a relative or absolute change in the primary metric. An absolute effect size of +0.5 percentage points on a 10% conversion rate translates to a relative effect size of +5%.
Minimum detectable effect (MDE) is the smallest effect size you want your test to be able to detect. Smaller MDEs require larger sample sizes. Setting the MDE involves a business judgment: what's the smallest improvement that would be worth implementing? If a 0.1% improvement isn't worth the engineering effort to ship, don't design your test to detect it.
Error Types
Type I Error (False Positive)
Concluding that the variant is different from the control when it actually isn't — rejecting the null hypothesis when it's true. The probability of a Type I error is equal to your significance level (alpha). At alpha = 0.05, you accept a 5% chance of a false positive.
The business cost of a Type I error is implementing a change that doesn't actually improve anything. At best, you waste engineering time. At worst, the "winning" variant actually had a different, unmeasured negative effect (like increasing short-term conversions but reducing long-term retention) that you attribute to success.
Type II Error (False Negative)
Failing to detect a real difference between the variant and control — failing to reject the null hypothesis when it's false. The probability of a Type II error is (1 - statistical power). With 80% power, you have a 20% chance of missing a real effect.
The business cost of a Type II error is abandoning a change that would have actually improved your metrics. This is often less visible than a false positive because you never implement the change, so you never see what you missed.
Statistical Power
The probability that your test will correctly detect a real effect of a given size. Conventionally set at 80%, meaning if the variant truly has the expected effect, there's an 80% chance the test will declare significance. Higher power (90%, 95%) requires more traffic but reduces the risk of Type II errors.
Power depends on three factors: sample size (more is better), effect size (larger effects are easier to detect), and significance level (a more lenient threshold increases power but also increases false positives). These three factors create a fundamental tradeoff that every experiment must navigate.
Validity and Reliability Concepts
Regression to the Mean
The statistical phenomenon where extreme measurements tend to be followed by measurements closer to the average. In A/B testing, this manifests as dramatic early results that moderate over time. A variant showing a 40% improvement after 100 visitors will almost certainly show a much smaller effect after 10,000 visitors.
Regression to the mean is not a flaw in your test — it's a mathematical certainty. It occurs because small samples are inherently noisier than large ones. The practical implication is that you should never make decisions based on early test results, no matter how compelling they look.
Novelty Effect
A temporary increase in engagement caused by a change being new, not better. When users encounter a visually different interface, they pay more attention, click more, and explore more — temporarily. Once the novelty wears off (usually within 1-3 weeks), engagement returns to its natural level, which may be no different from the control.
The novelty effect is particularly dangerous because it creates early positive signals that look like genuine improvements. Teams that stop tests early or run them for only a few days are especially vulnerable to implementing changes that showed a novelty-driven lift but have no lasting effect.
External Validity
The extent to which your test results generalize beyond the specific conditions of the experiment. A test run during Black Friday may not predict performance during a normal week. A test on your US traffic may not apply to your UK audience. A test on desktop visitors may not generalize to mobile.
External validity is the most commonly overlooked dimension of test quality. Teams obsess over statistical significance (internal validity) while assuming results generalize to all contexts (external validity). In reality, many test results are context-dependent, and shipping them universally can produce disappointing results.
Additional Important Terms
Conversion Rate
The percentage of visitors who complete a desired action (purchase, sign-up, click). Calculated as (number of conversions / number of visitors) * 100. This is the most common primary metric in A/B testing, though more sophisticated programs often use revenue per visitor or other metrics that capture both frequency and value.
Minimum Detectable Effect (MDE)
The smallest effect that your test is designed to detect with the specified power and significance level. A 5% MDE means your test can reliably detect effects of 5% or larger, but may miss smaller effects. Choosing the MDE is a business decision: it represents the smallest improvement worth knowing about, given the traffic and time investment required.
Peeking Problem
The practice of repeatedly checking test results before the predetermined sample size is reached, and stopping the test when results look significant. This is problematic because statistical significance is expected to fluctuate during a test. Checking at multiple points dramatically increases the false positive rate. A test designed for a 5% false positive rate can have an effective false positive rate of 20-30% or higher when checked repeatedly.
Segmentation
Analyzing test results within subgroups of your population (mobile vs. desktop, new vs. returning visitors, geographic regions). Segmentation can reveal that an overall neutral result hides a strong positive effect in one segment and a negative effect in another. However, post-hoc segmentation (deciding which segments to analyze after seeing results) inflates false positive rates. Pre-registered segments avoid this problem.
Guardrail Metrics
Secondary metrics monitored during a test to ensure the variant doesn't cause unintended harm, even if the primary metric improves. For example, a test that increases sign-ups but decreases revenue per user would be caught by a revenue guardrail metric. Guardrail metrics don't determine the test winner — they prevent harmful implementations.
Using This Glossary
These terms aren't academic abstractions — they're the building blocks of sound experimentation practice. Each represents a concept that, when misunderstood, leads to measurable business harm: wasted engineering time implementing false positives, abandoned ideas that actually work, or decisions based on novelty effects that disappear within weeks.
Bookmark this glossary and reference it when reviewing test designs, interpreting results, or debating methodology with colleagues. The precision of your language directly affects the quality of your decisions.