The Method — Atticus Li's Experimentation Operating System

Atticus Li

Process & Methodology

An Experimentation
Operating System.

Turning growth ideas into evidence-backed investment decisions. PRISM is the engine; the pipeline, prioritization standards, and measurement discipline around it are what make the numbers credible. Built on 280+ experiments and $30M+ in verified impact.

Apply This To Your Business →

Advisory & program builds available

The Engine

PRISM: Five Steps. Every Experiment.

Most experimentation programs test hunches. PRISM applies behavioral economics to find the cognitive mechanism driving (or blocking) conversion — then attaches a revenue forecast before the test runs.

The result: every experiment is accountable to a number, not just a direction.

Probe Research

Start with data, not guesses. Quantitative analytics show where revenue is leaking. Behavioral analytics — heatmaps, session recordings, funnel drop-off — show how users actually behave. Then behavioral economics assigns the why: is this a trust deficit? Cognitive overload? An anchoring effect working against conversion? Loss aversion at the payment step? Naming the cognitive mechanism is what separates a testable diagnosis from a redesign hunch — and it's where a decade of behavioral economics and psychology training does the heavy lifting.

Revenue Rank Prioritization

Every finding competes for test bandwidth. Score each hypothesis by one filter: expected revenue impact at current traffic. The test worth the most to the business runs first — not the easiest to build, not the most interesting to the team.

Implement Hypothesis & Execution

Write a hypothesis that names the behavioral mechanism, the expected change, and the predicted revenue impact. Then build the smallest test that validates it. Proper power calculations, sequential testing, and guardrail metrics to catch unintended side effects.

Score Analysis

Measure results against the pre-set revenue forecast. Win, Save (prevented a bad rollout), or Learning — every experiment closes with a verdict in revenue terms, not just a percentage lift.

Multiply Rollout & Compound

Ship the winner. Update the baseline. Feed the insight back into the next hypothesis cycle. Each winning experiment raises the floor for the next one. 100+ tests per year compounding is how $30M+ gets built.

The Operating System

From Idea to
Scaled Investment

Once a program scales past a handful of tests, the bottleneck stops being execution and becomes decision quality: which ideas get prioritized, what standards results are held to, and whether winners actually get scaled. This is the pipeline every test moves through — the same operating model that scaled an enterprise program from 20 to 100+ experiments a year.

01

Idea Intake & Problem Framing

Ideas come from customer behavior, analytics anomalies, market opportunities, channel performance, and stakeholder priorities. Every idea gets reframed as a business problem and the decision it will inform — "what should we invest in?" — before it earns a slot in the pipeline.

02

Hypothesis Design

IF / THEN / BECAUSE format, with the BECAUSE naming a behavioral mechanism — loss aversion, choice overload, social proof, anchoring. A hypothesis that can’t name its mechanism usually can’t explain its result either.

03

Prioritization (RICE)

Reach, Impact, Confidence, Effort — scored transparently so the pipeline is driven by expected business value, not by whoever asks loudest. Teams can see exactly why an idea moved forward or didn’t.

04

Experiment Design

Primary metric, guardrail metrics, audience, minimum detectable effect, power and sample-size requirements, duration, and a pre-agreed decision rule. Then match the method to the decision: onsite changes get A/B tests, lifecycle messaging gets holdouts, brand and market-level questions get geo-lift or incrementality designs.

05

Feasibility & Effort Sizing

Developer effort, design lift, analytics complexity, and — in regulated industries — legal and compliance review. Cheap to kill an infeasible test here; expensive to kill it after three sprints.

06

Analytics Setup & Tracking Audit

Instrument the events, QA the tags, validate the baseline numbers before anything gets built. Most "surprising" test results are tracking bugs wearing a costume.

07

UX/UI Design & Feedback

Wireframes and variant designs reviewed against the hypothesis — does the design actually manipulate the mechanism we named? Usability feedback loops before development, not after launch.

08

Development & QA

Build the smallest version that tests the hypothesis. Cross-browser, cross-device QA, and a variant-parity check so the test measures the change — not a rendering bug.

09

Launch & Monitoring

Guardrails watched from day one. No reacting to noisy early results unless the test was explicitly designed for sequential analysis. Peeking is how programs manufacture false winners.

10

Analysis & Readout

Bayesian and sequential methods, false-discovery-rate control, segment checks for heterogeneous effects. Every test closes with a verdict — win, save, or learning — in revenue terms.

11

Business Review & Stakeholder Comms

Executive readouts built on data storytelling: what we believed, what we found, what it’s worth, what we recommend. Stakeholders aligned on the decision criteria before the test see no surprises after it.

12

Decision & Handoff

Scale, iterate, stop, or hand off to the owning team — marketing, product, lifecycle, or finance. Handoff is where most experimentation programs quietly lose their value, so the owner of a winning idea is named before the test ever runs.

Prioritization

I've Run Programs on ICE,
PIE, and RICE.

RICE won — but not because it's the most sophisticated. Prioritization frameworks sit on a tradeoff between how complex they are and how useful they are. Every hour a team spends scoring ideas is an hour it isn't shipping experiments, and a model elaborate enough to feel "rigorous" is usually a planning tax in disguise.

RICE earns its extra letter: Reach forces the sample-size and feasibility conversation on day one. An idea with beautiful impact scores but no traffic to detect the effect dies in scoring instead of dying six weeks into a doomed test. ICE and PIE let that conversation happen too late.

Just as important: a transparent score defuses loudest-stakeholder bias. Teams can see exactly why an idea moved forward or didn't. But the score structures judgment — it never replaces it.

ICE Impact · Confidence · Effort

Fast, lightweight. Good for early-stage programs — but nothing stops a zero-traffic idea from scoring high.

PIE Potential · Importance · Ease

Page-centric and subjective — "potential" and "importance" overlap enough that scores drift toward whoever's in the room.

RICE Reach · Impact · Confidence · Effort

Reach makes measurement feasibility a first-class input. Slightly more work to score, materially better decisions — the right point on the complexity/usefulness curve.

Measurement

Measuring What a Program
Is Actually Worth

To make experimentation legible to the C-suite, every verified win runs through a financial impact model:

EBITDA Impact = Brand Monthly EBITDA × Annualized Traffic × Baseline CR × Relative Lift

It's a good executive-communication tool — and taken alone, it flatters the program. A credible experimentation leader tells the CFO where the model is wrong before the CFO finds out:

Winner's curse

Tests that reach significance tend to overestimate their true lift — you selected them because they looked good. Observed lifts get shrinkage applied before anyone annualizes them.

Lift decay & novelty effects

A 46-day lift is not a 12-month lift. Novelty fades, competitors respond, audiences saturate. Naive annualization is the most common way programs inflate their own impact.

Baseline drift & seasonality

Baseline conversion rates move with seasons, pricing, and market conditions. A static baseline in the formula quietly misattributes market movement to the program.

Revenue is not margin

Revenue impact and contribution-margin impact are different numbers, and the CFO cares about the second one. Where margin data exists, the model should use it.

Interaction effects

Ten concurrent winners rarely sum. Overlapping audiences and compounding changes mean the whole is usually less than the sum of the readouts.

The Correction

Long-Term Holdouts: The Program-Level Audit

The strongest correction for all five failure modes is the same one platforms like Eppo and Statsig have productized: long-term holdouts. Keep a small slice of the audience on the old experience for a quarter or more, ship winners to everyone else, and measure the cumulative gap. That single number captures lift decay, novelty effects, and interaction between concurrent winners — the true, durable value of the program rather than the sum of its optimistic readouts.

Per-test models make the program legible quarter to quarter. Holdouts keep it honest year to year. A mature program runs both.

The Caveat

There Is No
One-Size-Fits-All

Everything above is a starting architecture, not a template. The right version of this system depends on team size, statistical maturity, tooling, company politics, the OKRs and KPIs each team is actually measured on, bandwidth, and who owns what. A five-person growth team and a five-brand enterprise need very different amounts of process — and installing more governance than a team can absorb kills velocity just as surely as having none.

This is also the most common failure mode of hiring it out: agencies that install the same templated process at every client, regardless of how the team actually works. The mismatch shows up as friction — between teams, and inside them — and the program gets blamed for what was really a fit problem. The operating system has to serve the team. Never the other way around.

100+

Experiments / Year

$30M+

Verified Revenue Impact (2025)

24%+

A/B Test Win Rate (2025)

Each experiment feeds the next. Behavioral insights don't expire — they compound into a permanent intelligence advantage over competitors who are still guessing.

Next Step

Two Ways to Use
This System

For Companies

Building a program?

Audits, 90-day sprints, and advisory retainers — I build the experimentation operating system with your team, shaped to how your team actually works.

Work with me → For Hiring Teams

Hiring for experimentation or growth?

I lead growth experimentation functions — the operating model, the measurement standards, and the executive decision layer. See the case studies under Work, or reach out directly.

Get in touch →

You Don't Need to Learn PRISM. You Need Someone to Run It.

The framework works. I've run it 500+ times across Fortune 150 and startups. The question is whether you want to build this capability internally — or get results now.

See Pricing Below ↓

Investment

The average company leaves $2M+ in unrealized revenue from unoptimized funnels. One experiment that moves your conversion rate 0.5% could recover $500K. The question isn't cost — it's what it costs you to keep guessing.

Strategic Direction

Advisory

Starting at $5,000/mo

For teams that have the execution capacity but need the behavioral science expertise and experimentation strategy.

Monthly strategy sessions
Experimentation roadmap & prioritization
Behavioral audit of your funnel
Hypothesis frameworks for your team
Async Slack/email support
Month-to-month after first 90 days

Get Started →

Full Experimentation Program Most Popular

Done-for-You

Starting at $15,000/mo

I embed with your team and run the entire experimentation program. You get the strategist, the executor, and the system — not a junior analyst.

Everything in Advisory
Full experiment design, build & QA
Revenue forecasting per experiment
Executive reporting & stakeholder alignment
10–15 experiments per quarter
Direct access — no account managers

Get Started →

Half the cost of a top-tier agency. The person with $30M+ in results does the work — not a junior they hired last month.

Start a Conversation

Ready to Stop Guessing?

Tell me about your growth challenge. No pitch decks, no sales reps — just a direct conversation about whether I can help.

Ready now?

Book a 30-minute discovery call →

Lean Experiments Newsletter

Revenue Frameworks
for Growth Leaders

Every week: one experiment, one framework, one insight to make your marketing more evidence-based and your revenue more predictable.

Subscribe free

Free · No spam · Unsubscribe anytime

Browse issues

Read the archive

200+ issues of experiments, frameworks, and field reports from inside a Fortune 150 growth team.

Open Substack