You have your test idea. You have buy-in from stakeholders. Now you need to actually set it up without introducing the kind of implementation errors that silently invalidate your results.
Most A/B test failures are not statistical failures. They are setup failures. The hypothesis was vague. The implementation leaked between variations. The QA was nonexistent. The tracking was misconfigured. By the time anyone notices, the test has been running for two weeks on bad data.
This guide walks through every step of test setup from writing a testable hypothesis to choosing between client-side and server-side implementation to the pre-launch QA checklist that catches the mistakes nobody wants to find on day ten.
Writing a Testable Hypothesis
Every valid A/B test starts with a hypothesis. Not a hunch. Not a request from your VP. A structured, testable prediction that you can prove or disprove with data.
The format I use is: "If we [specific change], then [specific metric] will [direction and magnitude] because [behavioral mechanism]."
Each element does specific work:
- The change must be specific enough that someone else could implement it without ambiguity. "Improve the pricing page" is not a change. "Replace the feature comparison table with a three-tier card layout" is.
- The metric must be measurable and directly tied to the change. If you change the pricing page layout, your primary metric should be plan selection rate, not blog newsletter signups.
- The direction and magnitude set your minimum detectable effect. "Increase" is not enough. "Increase by at least 5% relative" gives you a number to plug into your sample size calculation.
- The behavioral mechanism is what makes this a hypothesis instead of a guess. It is your theory about why the change will work. When the test concludes, this mechanism is what you are really testing.
Good Hypotheses vs. Bad Hypotheses
Bad: "Let's test a new homepage hero image." This is a task, not a hypothesis. There is no prediction, no metric, and no mechanism.
Better: "Changing the hero image to show the product in use will increase click-through to the features page." This has a change and a metric but no mechanism.
Good: "If we replace the abstract hero image with a screenshot of the dashboard showing real data, then click-through to the features page will increase by at least 8% because visitors currently cannot visualize how the product solves their problem, creating uncertainty that suppresses engagement."
The good hypothesis teaches you something whether it wins or loses. If it loses, you have learned that product visualization is not the barrier to engagement, and you can cross that theory off your list.
Defining Success Metrics
Every test needs a primary metric, guardrail metrics, and secondary metrics. Define all three before you launch.
Primary metric. This is the single number that determines whether the test succeeded. Choose one. Not three. One. Having multiple primary metrics inflates your false positive rate and creates ambiguity about what constitutes a win.
Guardrail metrics. These are metrics that must not degrade. If your pricing page test increases plan selection but decreases average revenue per user, the guardrail catches it. Common guardrails include bounce rate, page load time, and downstream conversion metrics.
Secondary metrics. These provide context and help you understand the mechanism behind the result. They are not decision criteria — they are learning tools.
Calculating Sample Size and Duration
Before you write a single line of code, calculate your required sample size and test duration. This step is non-negotiable. Without it, you have no way of knowing whether your test will produce reliable results.
You need four inputs: your baseline conversion rate, your minimum detectable effect, your significance level (usually 0.05), and your desired statistical power (usually 0.80).
Plug these into any sample size calculator. The output tells you how many visitors per variation you need. Divide by your daily traffic to the tested page, and you have your minimum test duration. Round up to complete weeks to account for day-of-week effects.
If the duration is longer than your organization can tolerate, you have three options: increase the MDE (test bolder changes), increase traffic to the page, or choose a different test. Do not reduce the sample size and hope for the best.
Client-Side vs. Server-Side Implementation
How you implement the test determines what you can test, how fast it loads, and how many things can go wrong.
Client-Side Testing
Client-side testing uses JavaScript to modify the page after it loads in the user's browser. Tools like Google Optimize, VWO, and Optimizely's web experimentation product use this approach.
Advantages: No engineering resources needed for simple tests. Marketing teams can create and launch tests without code deployments. Fast iteration.
Disadvantages: Flickering (users see the original briefly before the variant loads). Limited to visual changes. Performance overhead from the testing script. Cannot test backend logic, pricing, or algorithms.
Client-side is best for: headline tests, image swaps, layout changes, CTA button modifications, and other visual experiments on marketing pages.
Server-Side Testing
Server-side testing determines the variation before the page is sent to the browser. The user receives only the content for their assigned variation. Tools like LaunchDarkly, Eppo, and Statsig support this.
Advantages: No flickering. Can test anything — pricing, algorithms, features, backend logic. Better performance. More reliable variation assignment.
Disadvantages: Requires engineering resources for every test. Slower to set up. Higher coordination overhead between product, engineering, and analytics teams.
Server-side is best for: product feature tests, pricing experiments, algorithm changes, onboarding flow modifications, and any test where flickering would compromise the user experience or the test validity.
The Implementation Checklist
Regardless of client-side or server-side, every test implementation needs to get these things right:
- Randomization is truly random. Users must be randomly assigned to variations. Any systematic bias in assignment invalidates the entire test. Verify that your tool uses proper randomization, not something like "even/odd user IDs."
- Assignment is persistent. A user who sees Variant A on their first visit must see Variant A on every subsequent visit during the test. If assignment resets, the same user contaminates both groups.
- Variations are isolated. No cross-contamination between control and variant. If your variant includes a JavaScript change, make sure it only fires for variant users. Leaked CSS or scripts that affect both groups will bias your results toward no difference.
- Tracking fires correctly. Your analytics must correctly attribute each conversion to the right variation. Test this explicitly. Send test traffic, verify that events appear in the correct variation bucket, and check that the numbers add up.
- The control is truly unchanged. The control group should see exactly the current experience. If adding test infrastructure changes the control experience (even a slight performance hit from loading the testing script), you are comparing the variant against a degraded baseline.
Pre-Launch QA Checklist
Run through this checklist before every test launch. I have seen every one of these items cause a test failure.
- Preview the variant in all target browsers (Chrome, Safari, Firefox, Edge)
- Preview on mobile devices (iOS Safari, Android Chrome at minimum)
- Verify the control looks identical to the current production experience
- Confirm variation assignment persists across sessions (close browser, reopen, verify same variation)
- Check that conversion tracking fires for both control and variant
- Verify no flickering on the variant (for client-side tests)
- Test with ad blockers enabled (they can break client-side testing scripts)
- Confirm the test does not break any critical user flows (registration, checkout, login)
- Verify traffic allocation is set correctly (usually 50/50)
- Document the expected end date based on sample size calculations
The Test Brief
Before launching, create a test brief document. This is part of the overall experimentation process and serves as both a planning tool and a historical record. It should include:
- The hypothesis in full
- Primary metric, guardrail metrics, and secondary metrics
- Required sample size and expected test duration
- Screenshots of control and variant
- Target audience and any exclusions
- Early stopping criteria
- Who is responsible for monitoring and analysis
Common Setup Mistakes
After reviewing hundreds of test setups, these are the mistakes I see most often:
Testing too many changes at once. If your variant changes the headline, the image, the CTA, and the layout, you cannot attribute the result to any specific change. Test one variable per experiment. This is the core principle that separates A/B testing from just redesigning things.
Skipping sample size calculation. Teams launch tests with no idea how long they need to run, then peek at results and call winners prematurely. This is the most common way to produce false positives.
Not documenting the control. The control experience changes over time as other teams ship features. If you do not screenshot and document the exact control at test launch, you lose the ability to replicate or contextualize results later.
Forgetting about mobile. A variant that looks great on desktop but breaks on mobile will produce misleading aggregate results. Always test across devices.
Not planning the analysis in advance. Decide how you will analyze the results before you launch. If you plan to segment by device type or traffic source, make sure you are capturing that data from day one.
Pro Tip: The 24-Hour Sanity Check
After launching, check your test data after 24 hours. You are not checking for winners — you are checking for setup problems. Verify that:
- Traffic is roughly evenly split between variations
- Both variations are recording conversions
- Conversion rates are in a plausible range (not zero, not 100%)
- No errors in the browser console related to the test
If anything looks wrong, fix it and restart the test. Do not try to salvage data from a misconfigured test. The 24 hours of lost time is nothing compared to the weeks you would waste running on bad data.
What to Learn Next
This article covers test setup and implementation. Here is where to go from here:
- What Is A/B Testing? — review the fundamentals and the five components every valid test requires
- The A/B Testing Process — understand how test setup fits into the end-to-end experimentation workflow
- How Long Should You Run an A/B Test? — calculate the sample size and duration your test requires
- How to Analyze A/B Test Results — plan your analysis approach before launch so you capture the right data