p-value
The probability of observing results at least as extreme as the actual results, assuming there is no real difference between the variations.
The p-value is the most misunderstood number in experimentation. It does NOT tell you the probability that your variation is better. It tells you the probability of seeing data this extreme if the null hypothesis (no difference) were true.
The Correct Interpretation
A p-value of 0.03 means: "If there were truly no difference between control and variation, there's only a 3% chance we'd see results this extreme by random chance." It's evidence against the null hypothesis, not proof that your variation works.
Common Misinterpretations
- "p = 0.03 means there's a 97% chance the variation wins" — WRONG
- "p = 0.03 means the variation lifts conversion by 3%" — WRONG
- "p = 0.06 means the test failed" — WRONG (it means the evidence is weaker, not absent)
p-values in Practice
Most testing platforms display p-values prominently, which unfortunately encourages binary thinking. A better practice is to focus on confidence intervals and effect sizes, using p-values only as a preliminary filter.
The 0.05 threshold is arbitrary — it's not a natural law. For low-cost, easily reversible changes, p = 0.10 might be acceptable. For pricing changes that affect revenue, p = 0.01 is more appropriate.