How to Set Up an A/B Test: Hypotheses, Tools, and Implementation

Atticus Li

← Blog · ab-testing

How to Set Up an A/B Test: Hypotheses, Tools, and Implementation

Step-by-step guide to setting up A/B tests properly — from writing testable hypotheses to choosing between server-side and client-side tools to the QA checklist before launch.

By Atticus Li March 28, 2026 9 min read

You have your test idea. You have buy-in from stakeholders. Now you need to actually set it up without introducing the kind of implementation errors that silently invalidate your results.

Most A/B test failures are not statistical failures. They are setup failures. The hypothesis was vague. The implementation leaked between variations. The QA was nonexistent. The tracking was misconfigured. By the time anyone notices, the test has been running for two weeks on bad data.

This guide walks through every step of test setup from writing a testable hypothesis to choosing between client-side and server-side implementation to the pre-launch QA checklist that catches the mistakes nobody wants to find on day ten.

Writing a Testable Hypothesis

Every valid A/B test starts with a hypothesis. Not a hunch. Not a request from your VP. A structured, testable prediction that you can prove or disprove with data.

The format I use is: "If we [specific change], then [specific metric] will [direction and magnitude] because [behavioral mechanism]."

Each element does specific work:

The change must be specific enough that someone else could implement it without ambiguity. "Improve the pricing page" is not a change. "Replace the feature comparison table with a three-tier card layout" is.
The metric must be measurable and directly tied to the change. If you change the pricing page layout, your primary metric should be plan selection rate, not blog newsletter signups.
The direction and magnitude set your minimum detectable effect. "Increase" is not enough. "Increase by at least 5% relative" gives you a number to plug into your sample size calculation.
The behavioral mechanism is what makes this a hypothesis instead of a guess. It is your theory about why the change will work. When the test concludes, this mechanism is what you are really testing.

Good Hypotheses vs. Bad Hypotheses

Bad: "Let's test a new homepage hero image." This is a task, not a hypothesis. There is no prediction, no metric, and no mechanism.

Better: "Changing the hero image to show the product in use will increase click-through to the features page." This has a change and a metric but no mechanism.

Good: "If we replace the abstract hero image with a screenshot of the dashboard showing real data, then click-through to the features page will increase by at least 8% because visitors currently cannot visualize how the product solves their problem, creating uncertainty that suppresses engagement."

The good hypothesis teaches you something whether it wins or loses. If it loses, you have learned that product visualization is not the barrier to engagement, and you can cross that theory off your list.

Defining Success Metrics

Every test needs a primary metric, guardrail metrics, and secondary metrics. Define all three before you launch.

Primary metric. This is the single number that determines whether the test succeeded. Choose one. Not three. One. Having multiple primary metrics inflates your false positive rate and creates ambiguity about what constitutes a win.

Guardrail metrics. These are metrics that must not degrade. If your pricing page test increases plan selection but decreases average revenue per user, the guardrail catches it. Common guardrails include bounce rate, page load time, and downstream conversion metrics.

Secondary metrics. These provide context and help you understand the mechanism behind the result. They are not decision criteria — they are learning tools.

Calculating Sample Size and Duration

Before you write a single line of code, calculate your required sample size and test duration. This step is non-negotiable. Without it, you have no way of knowing whether your test will produce reliable results.

You need four inputs: your baseline conversion rate, your minimum detectable effect, your significance level (usually 0.05), and your desired statistical power (usually 0.80).

Plug these into any sample size calculator. The output tells you how many visitors per variation you need. Divide by your daily traffic to the tested page, and you have your minimum test duration. Round up to complete weeks to account for day-of-week effects.

If the duration is longer than your organization can tolerate, you have three options: increase the MDE (test bolder changes), increase traffic to the page, or choose a different test. Do not reduce the sample size and hope for the best.

Client-Side vs. Server-Side Implementation

How you implement the test determines what you can test, how fast it loads, and how many things can go wrong.

Client-Side Testing

Client-side testing uses JavaScript to modify the page after it loads in the user's browser. Tools like Google Optimize, VWO, and Optimizely's web experimentation product use this approach.

Advantages: No engineering resources needed for simple tests. Marketing teams can create and launch tests without code deployments. Fast iteration.

Disadvantages: Flickering (users see the original briefly before the variant loads). Limited to visual changes. Performance overhead from the testing script. Cannot test backend logic, pricing, or algorithms.

Client-side is best for: headline tests, image swaps, layout changes, CTA button modifications, and other visual experiments on marketing pages.

Server-Side Testing

Server-side testing determines the variation before the page is sent to the browser. The user receives only the content for their assigned variation. Tools like LaunchDarkly, Eppo, and Statsig support this.

Advantages: No flickering. Can test anything — pricing, algorithms, features, backend logic. Better performance. More reliable variation assignment.

Disadvantages: Requires engineering resources for every test. Slower to set up. Higher coordination overhead between product, engineering, and analytics teams.

Server-side is best for: product feature tests, pricing experiments, algorithm changes, onboarding flow modifications, and any test where flickering would compromise the user experience or the test validity.

The Implementation Checklist

Regardless of client-side or server-side, every test implementation needs to get these things right:

Randomization is truly random. Users must be randomly assigned to variations. Any systematic bias in assignment invalidates the entire test. Verify that your tool uses proper randomization, not something like "even/odd user IDs."
Assignment is persistent. A user who sees Variant A on their first visit must see Variant A on every subsequent visit during the test. If assignment resets, the same user contaminates both groups.
Variations are isolated. No cross-contamination between control and variant. If your variant includes a JavaScript change, make sure it only fires for variant users. Leaked CSS or scripts that affect both groups will bias your results toward no difference.
Tracking fires correctly. Your analytics must correctly attribute each conversion to the right variation. Test this explicitly. Send test traffic, verify that events appear in the correct variation bucket, and check that the numbers add up.
The control is truly unchanged. The control group should see exactly the current experience. If adding test infrastructure changes the control experience (even a slight performance hit from loading the testing script), you are comparing the variant against a degraded baseline.

Pre-Launch QA Checklist

Run through this checklist before every test launch. I have seen every one of these items cause a test failure.

Preview the variant in all target browsers (Chrome, Safari, Firefox, Edge)
Preview on mobile devices (iOS Safari, Android Chrome at minimum)
Verify the control looks identical to the current production experience
Confirm variation assignment persists across sessions (close browser, reopen, verify same variation)
Check that conversion tracking fires for both control and variant
Verify no flickering on the variant (for client-side tests)
Test with ad blockers enabled (they can break client-side testing scripts)
Confirm the test does not break any critical user flows (registration, checkout, login)
Verify traffic allocation is set correctly (usually 50/50)
Document the expected end date based on sample size calculations

The Test Brief

Before launching, create a test brief document. This is part of the overall experimentation process and serves as both a planning tool and a historical record. It should include:

The hypothesis in full
Primary metric, guardrail metrics, and secondary metrics
Required sample size and expected test duration
Screenshots of control and variant
Target audience and any exclusions
Early stopping criteria
Who is responsible for monitoring and analysis

Common Setup Mistakes

After reviewing hundreds of test setups, these are the mistakes I see most often:

Testing too many changes at once. If your variant changes the headline, the image, the CTA, and the layout, you cannot attribute the result to any specific change. Test one variable per experiment. This is the core principle that separates A/B testing from just redesigning things.

Skipping sample size calculation. Teams launch tests with no idea how long they need to run, then peek at results and call winners prematurely. This is the most common way to produce false positives.

Not documenting the control. The control experience changes over time as other teams ship features. If you do not screenshot and document the exact control at test launch, you lose the ability to replicate or contextualize results later.

Forgetting about mobile. A variant that looks great on desktop but breaks on mobile will produce misleading aggregate results. Always test across devices.

Not planning the analysis in advance. Decide how you will analyze the results before you launch. If you plan to segment by device type or traffic source, make sure you are capturing that data from day one.

Pro Tip: The 24-Hour Sanity Check

After launching, check your test data after 24 hours. You are not checking for winners — you are checking for setup problems. Verify that:

Traffic is roughly evenly split between variations
Both variations are recording conversions
Conversion rates are in a plausible range (not zero, not 100%)
No errors in the browser console related to the test

If anything looks wrong, fix it and restart the test. Do not try to salvage data from a misconfigured test. The 24 hours of lost time is nothing compared to the weeks you would waste running on bad data.

What to Learn Next

This article covers test setup and implementation. Here is where to go from here:

What Is A/B Testing? — review the fundamentals and the five components every valid test requires
The A/B Testing Process — understand how test setup fits into the end-to-end experimentation workflow
How Long Should You Run an A/B Test? — calculate the sample size and duration your test requires
How to Analyze A/B Test Results — plan your analysis approach before launch so you capture the right data

ab-testing data-tools hypothesis series-ab-testing-guide

Atticus Li

Experimentation and growth leader. CXL-certified CRO practitioner, Mindworx-certified behavioral economist (1 of ~1,000 worldwide). 200+ A/B tests across energy, SaaS, fintech, e-commerce, and marketplace verticals.

About LinkedIn Newsletter

How to Set Up an A/B Test: Hypotheses, Tools, and Implementation

Writing a Testable Hypothesis

Good Hypotheses vs. Bad Hypotheses

Defining Success Metrics

Calculating Sample Size and Duration

Client-Side vs. Server-Side Implementation

Client-Side Testing

Server-Side Testing

The Implementation Checklist

Pre-Launch QA Checklist

The Test Brief

Common Setup Mistakes

Pro Tip: The 24-Hour Sanity Check

What to Learn Next

Three places this work shows up.

GrowthLayer

Consulting

Jobsolv

Get the Weekly
Experimentation Playbook

Writing a Testable Hypothesis

Good Hypotheses vs. Bad Hypotheses

Defining Success Metrics

Calculating Sample Size and Duration

Client-Side vs. Server-Side Implementation

Client-Side Testing

Server-Side Testing

The Implementation Checklist

Pre-Launch QA Checklist

The Test Brief

Common Setup Mistakes

Pro Tip: The 24-Hour Sanity Check

What to Learn Next

Related Articles

How to Write A/B Test Hypotheses That Actually Hold Up

The Commitment Trap: Why Forcing Users to Opt-In Destroys Conversions (and What Loss Aversion Actually Predicts)

The Six-Figure Decision: How Strategic Price De-Emphasis Reveals the True Economics of Attention

Related Articles

How to Write A/B Test Hypotheses That Actually Hold Up

The Commitment Trap: Why Forcing Users to Opt-In Destroys Conversions (and What Loss Aversion Actually Predicts)

The Six-Figure Decision: How Strategic Price De-Emphasis Reveals the True Economics of Attention

Three places this work shows up.

GrowthLayer

Consulting

Jobsolv

Get the WeeklyExperimentation Playbook

Get the Weekly
Experimentation Playbook