The Most Misunderstood Number in Business
Ask ten people running A/B tests what a p-value means, and you will get ten different answers, most of them wrong. The p-value is arguably the most misunderstood statistical concept in business, and these misunderstandings lead to genuinely bad decisions. Understanding what a p-value actually tells you, and more importantly what it does not tell you, is essential for anyone making data-driven decisions.
What a P-Value Is Not
Let us start by demolishing the most common misconception. A p-value does NOT tell you the probability that your variation B is better than the control A. This is what most people think it means, and this interpretation is fundamentally incorrect.
Here are several other things a p-value does not tell you:
It does not tell you the probability that the null hypothesis is true. It does not tell you the size of the effect. It does not tell you whether the result is practically important. It does not tell you whether you should ship the variation. And it absolutely does not tell you that you can stop the test.
What a P-Value Actually Is
A p-value is the probability of observing a result as extreme as (or more extreme than) your current result, assuming that the null hypothesis is true. In A/B testing terms, the null hypothesis states that there is no difference between your control and variation. They are functionally identical.
So a p-value of 0.03 means: if there were truly no difference between A and B, there would be a 3% chance of seeing data as extreme as what you observed. It is a statement about the data given an assumption, not a statement about the assumption given the data. This distinction is subtle but critical.
An analogy: imagine you see someone at the grocery store wearing a full astronaut suit. You might think this is unlikely if they are just a regular shopper (your null hypothesis). The p-value would be the probability of seeing someone in an astronaut suit given that they are a regular shopper. A low p-value would lead you to reject the null hypothesis and conclude they are probably not a regular shopper. But the p-value does not directly tell you the probability that they are an astronaut. They could be going to a costume party.
Why P < 0.05 Became the Standard
The 0.05 threshold is a convention, not a law of nature. It originates from early 20th-century statistical practice, where it was suggested as a convenient cutoff for preliminary research. Over time, through repetition and institutional inertia, it hardened into a rigid standard.
The 0.05 threshold means you are willing to accept a 5% false positive rate, meaning 1 in 20 tests that show no real effect will appear statistically significant by chance. Whether 5% is the right threshold depends entirely on your context. If a false positive costs your company millions of dollars, you might want 0.01 or even 0.001. If the cost of a false positive is low and the cost of missing a true positive is high, 0.10 might be more appropriate.
The key insight is that the significance level is a decision about how much false positive risk you are willing to tolerate. It should be chosen based on the business context, not reflexively set to 0.05 because that is what everyone else does.
What Statistical Significance Actually Means
When someone says a result is "statistically significant," they mean the p-value is below the predetermined significance threshold. If you set alpha at 0.05 and your p-value is 0.03, the result is statistically significant. If your p-value is 0.07, it is not.
But statistical significance is not the same as practical significance. A test could show a statistically significant 0.01% improvement in conversion rate. Statistically, the effect is real. Practically, it is meaningless. The improvement is so small that it would never justify the engineering effort to implement the change.
Conversely, a test might show a 15% improvement with a p-value of 0.08. Statistically, this is not significant at the 0.05 level. But the effect size is large enough that it might warrant further investigation or a follow-up test with more traffic.
Good decision-making requires considering both statistical significance and practical significance together, never one in isolation.
Why Significance Is Not a Stopping Rule
One of the most damaging practices in A/B testing is treating statistical significance as a signal to stop the test. As discussed in the context of test duration, checking for significance repeatedly and stopping when you find it inflates your actual false positive rate far beyond the nominal 5%.
The correct approach is to predetermine your sample size (based on baseline rate, minimum detectable effect, and desired power), run the test until you reach that sample size, and then evaluate significance at that single predetermined endpoint. Significance tells you something useful only when evaluated at the planned analysis point.
The Base Rate Problem
Here is another subtlety that trips people up. The interpretation of a p-value depends heavily on the prior probability that your hypothesis is true. This is known as the base rate problem.
If you test 100 random changes and only 10 of them actually have a real effect, then among your 90 null tests (no real effect), 5% will show false significance, giving you about 4-5 false positives. Among your 10 real tests, assuming 80% power, about 8 will correctly show significance. So out of roughly 12-13 significant results, about 4-5 are false, meaning nearly 35-40% of your significant findings are wrong.
This is why hypothesis quality matters so much. If most of your test ideas are well-reasoned changes backed by user research and behavioral analysis, a higher proportion of your significant results will be genuine. If you are testing random ideas, a large fraction of your significant findings will be false positives even with perfect methodology.
P-Values and Decision Making: A Framework
Given all of these nuances, how should you actually use p-values in practice? Here is a pragmatic framework:
Use p-values as one input among many. Combine them with effect size, confidence intervals, practical significance, and business context.
Never treat 0.05 as a magic boundary. A p-value of 0.049 is not meaningfully different from 0.051. Do not let an arbitrary threshold make your decision for you.
Consider the cost of being wrong. If implementing a false winner costs very little (you can easily revert), a higher alpha threshold might be acceptable. If it costs a lot (permanent changes, large engineering effort), demand stronger evidence.
Always look at confidence intervals. They tell you the range of plausible effect sizes and are far more informative than a single p-value.
Evaluate at the predetermined endpoint. Not before, not after. At exactly the point you planned.
The Bigger Picture: Moving Beyond P-Value Fixation
The experimentation community has been gradually moving away from rigid p-value thresholds toward more nuanced approaches. Effect estimation, confidence intervals, and decision-theoretic frameworks that incorporate business costs and benefits all provide richer information than a single yes/no significance test.
The p-value is a useful tool when correctly understood and properly applied. But it is a terrible master. Do not let it make your decisions for you. Use it as one piece of evidence in a broader decision-making process that accounts for effect size, business context, implementation costs, and the quality of your hypothesis.
Key Takeaways
A p-value is the probability of seeing your data given that no real effect exists. It is not the probability that your variation is better. The 0.05 threshold is a convention, not a commandment. Statistical significance is different from practical significance. P-values should be evaluated at predetermined endpoints, never used as continuous stopping rules. And they should always be combined with effect sizes, confidence intervals, and business context for sound decision-making.