The Lab vs the Real World: Why Online Experiments Are Harder Than They Look
In a laboratory experiment, researchers control everything. The temperature is constant. The lighting is uniform. The subjects are selected and assigned carefully. The experimental protocol is followed precisely. The environment is, by design, stable and predictable.
Online experiments enjoy none of these luxuries. Your test runs on a live website where traffic sources shift hourly, user intent varies by day, marketing campaigns launch without warning, competitors change their pricing, and the very composition of your audience fluctuates constantly. This non-stationarity is the fundamental challenge of online experimentation, and it creates external validity threats that can make your test results unreliable.
External validity refers to the extent to which your results generalize beyond the specific conditions of the test. If your test shows a 15% lift but only because it ran during a unique set of circumstances, implementing that change permanently may not produce the expected improvement.
Why Website Data Is Non-Stationary
Stationarity means that the statistical properties of your data do not change over time. Website data is almost never stationary. Your conversion rate on a Tuesday in March is generated by a fundamentally different process than your conversion rate on a Saturday in December. The people are different, their intentions are different, and their behavior patterns are different.
This matters because most A/B testing statistics assume that your data is drawn from a single, stable population. When the population shifts during your test, your statistical analysis may be valid mathematically but meaningless practically. You are comparing averages across a mixture of populations, and the result may not apply to any of them individually.
Sources of External Validity Threats
Seasonal and Cyclical Variations
User behavior shifts dramatically across seasons. E-commerce sites see fundamentally different behavior during holiday shopping seasons. B2B SaaS companies see budget-driven patterns around quarter ends and fiscal year boundaries. Travel sites see seasonal demand patterns. If your test captures only one phase of a seasonal cycle, your results may not hold during other phases.
Day-of-Week Effects
Monday visitors differ from Friday visitors, who differ from Sunday visitors. They arrive through different channels, have different levels of urgency, and display different browsing patterns. A test that runs only on weekdays systematically excludes weekend behavior. This is why running tests for at least two full weeks is critical: it ensures you capture the complete weekly cycle twice.
Press Mentions and Viral Events
A press mention, a viral social media post, or a celebrity endorsement can temporarily flood your site with visitors who are categorically different from your normal audience. These visitors often have lower purchase intent and different demographic profiles. If such an event happens during your test, it can either inflate or deflate your measured effect depending on how the new visitors interact with your variation.
Marketing Campaigns and PPC Changes
Paid campaigns directly alter your traffic composition. A new search campaign brings visitors with different keywords and different intent. An email campaign brings existing customers rather than new prospects. A display retargeting campaign brings warm leads who have already visited. Any of these changes during a test can shift your results in ways that do not represent steady-state performance.
SEO and Organic Traffic Shifts
Search engine algorithm updates can shift your organic traffic composition overnight. New ranking positions for different keywords bring visitors with different search intent. A competitor ranking change can alter the comparison context that visitors use when evaluating your site. These shifts are largely invisible and can quietly compromise your test results.
Word of Mouth and Referral Patterns
If your product gets discussed in a relevant online community during your test, the influx of referred visitors brings a different type of audience. These visitors have been pre-sold on certain aspects of your product and may react differently to your test variation than unprimed visitors would.
Sample Pollution: When Your Test Groups Are Not Clean
Sample pollution occurs when visitors assigned to one variation are somehow exposed to or influenced by the other variation. This can happen in several ways:
Cross-device contamination: a user sees the control on their laptop and the variation on their phone. Bot traffic: automated crawlers assigned to both groups, diluting your human signal. Shared computers: one user converts under the control bucket while a family member on the same device is bucketed into the variation. Each of these compromises the clean separation between test groups that randomized experimentation requires.
The Flicker Effect: A Subtle Technical Threat
The flicker effect occurs when visitors briefly see the original version of a page before the test variation loads. This happens when the testing code executes after the page has already started rendering. The visitor sees the original content flash briefly, then watches it change to the variation.
This is an external validity threat because it introduces a confound. Visitors who see the flicker have a different experience than visitors who do not. Some may be confused or distrustful, affecting their behavior in ways that have nothing to do with the variation itself. The measured effect becomes a mixture of the actual variation effect and the flicker effect, making it impossible to attribute results cleanly.
Mitigating the flicker effect requires implementing tests server-side or using anti-flicker techniques that hide the page until the variation is fully loaded. Either approach adds technical complexity but is necessary for clean results.
Revenue Tracking Errors
When your primary metric involves revenue, tracking errors become a serious threat. A single high-value outlier transaction can dramatically skew results. If one variation happens to contain the visitor who makes a $50,000 purchase, that variation will appear to massively outperform the other, even if the difference is entirely due to this one transaction.
Revenue data is also susceptible to refunds, chargebacks, and delayed transactions that may not be captured during the test window. A variation that appears to generate more revenue during the test might look different once returns and cancellations are factored in.
Selection Bias and Self-Selection
Selection bias occurs when the sample of users in your test is not representative of your overall population. This can happen through technical issues with traffic allocation, such as excluding certain browsers or devices. It can happen through geographic biases if your CDN serves variations differently in different regions. And it can happen through temporal biases if your test starts at a particular time that systematically excludes certain user segments.
Self-selection is a related problem where the act of participating in the test is correlated with the outcome. For example, if your variation takes slightly longer to load and impatient users bounce before being counted, your variation group is biased toward more patient users, who may also be more likely to convert.
How to Mitigate External Validity Threats
Run tests for full business cycles. Capture the full range of traffic patterns, including all days of the week and ideally multiple weeks.
Document external events. Keep a log of marketing campaigns, press coverage, product changes, and other events that coincide with your test. Use this log to contextualize results.
Segment your results. Look at results broken down by traffic source, device, day of week, and other dimensions. If the effect only exists in one segment, it may not generalize.
Use holdback tests. After implementing a winner, keep a small holdback group on the original version to verify that the effect persists in production.
Implement tests server-side when possible. This eliminates the flicker effect and provides cleaner assignment.
Cap outliers in revenue analysis. Winsorize or cap extreme values to prevent individual transactions from dominating results.
Key Takeaways
Online experiments face far more external validity threats than controlled laboratory studies. Website data is non-stationary, with traffic composition shifting due to seasons, campaigns, press events, and countless other factors. Sample pollution, flicker effects, revenue tracking errors, and selection bias can all compromise results. Mitigating these threats requires running tests for full business cycles, documenting external events, segmenting results, and implementing technical best practices. Always ask not just whether your test is statistically significant but whether the conditions of the test are representative of the conditions under which the change will be deployed permanently.