An Experimentation
Operating System.
Turning growth ideas into evidence-backed investment decisions. PRISM is the engine; the pipeline, prioritization standards, and measurement discipline around it are what make the numbers credible. Built on 280+ experiments and $30M+ in verified impact.
Advisory & program builds available
PRISM: Five Steps. Every Experiment.
Most experimentation programs test hunches. PRISM applies behavioral economics to find the cognitive mechanism driving (or blocking) conversion — then attaches a revenue forecast before the test runs.
The result: every experiment is accountable to a number, not just a direction.
Start with data, not guesses. Quantitative analytics show where revenue is leaking. Behavioral analytics — heatmaps, session recordings, funnel drop-off — show how users actually behave. Then behavioral economics assigns the why: is this a trust deficit? Cognitive overload? An anchoring effect working against conversion? Loss aversion at the payment step? Naming the cognitive mechanism is what separates a testable diagnosis from a redesign hunch — and it's where a decade of behavioral economics and psychology training does the heavy lifting.
Every finding competes for test bandwidth. Score each hypothesis by one filter: expected revenue impact at current traffic. The test worth the most to the business runs first — not the easiest to build, not the most interesting to the team.
Write a hypothesis that names the behavioral mechanism, the expected change, and the predicted revenue impact. Then build the smallest test that validates it. Proper power calculations, sequential testing, and guardrail metrics to catch unintended side effects.
Measure results against the pre-set revenue forecast. Win, Save (prevented a bad rollout), or Learning — every experiment closes with a verdict in revenue terms, not just a percentage lift.
Ship the winner. Update the baseline. Feed the insight back into the next hypothesis cycle. Each winning experiment raises the floor for the next one. 100+ tests per year compounding is how $30M+ gets built.
From Idea to
Scaled Investment
Once a program scales past a handful of tests, the bottleneck stops being execution and becomes decision quality: which ideas get prioritized, what standards results are held to, and whether winners actually get scaled. This is the pipeline every test moves through — the same operating model that scaled an enterprise program from 20 to 100+ experiments a year.
Idea Intake & Problem Framing
Ideas come from customer behavior, analytics anomalies, market opportunities, channel performance, and stakeholder priorities. Every idea gets reframed as a business problem and the decision it will inform — "what should we invest in?" — before it earns a slot in the pipeline.
Hypothesis Design
IF / THEN / BECAUSE format, with the BECAUSE naming a behavioral mechanism — loss aversion, choice overload, social proof, anchoring. A hypothesis that can’t name its mechanism usually can’t explain its result either.
Prioritization (RICE)
Reach, Impact, Confidence, Effort — scored transparently so the pipeline is driven by expected business value, not by whoever asks loudest. Teams can see exactly why an idea moved forward or didn’t.
Experiment Design
Primary metric, guardrail metrics, audience, minimum detectable effect, power and sample-size requirements, duration, and a pre-agreed decision rule. Then match the method to the decision: onsite changes get A/B tests, lifecycle messaging gets holdouts, brand and market-level questions get geo-lift or incrementality designs.
Feasibility & Effort Sizing
Developer effort, design lift, analytics complexity, and — in regulated industries — legal and compliance review. Cheap to kill an infeasible test here; expensive to kill it after three sprints.
Analytics Setup & Tracking Audit
Instrument the events, QA the tags, validate the baseline numbers before anything gets built. Most "surprising" test results are tracking bugs wearing a costume.
UX/UI Design & Feedback
Wireframes and variant designs reviewed against the hypothesis — does the design actually manipulate the mechanism we named? Usability feedback loops before development, not after launch.
Development & QA
Build the smallest version that tests the hypothesis. Cross-browser, cross-device QA, and a variant-parity check so the test measures the change — not a rendering bug.
Launch & Monitoring
Guardrails watched from day one. No reacting to noisy early results unless the test was explicitly designed for sequential analysis. Peeking is how programs manufacture false winners.
Analysis & Readout
Bayesian and sequential methods, false-discovery-rate control, segment checks for heterogeneous effects. Every test closes with a verdict — win, save, or learning — in revenue terms.
Business Review & Stakeholder Comms
Executive readouts built on data storytelling: what we believed, what we found, what it’s worth, what we recommend. Stakeholders aligned on the decision criteria before the test see no surprises after it.
Decision & Handoff
Scale, iterate, stop, or hand off to the owning team — marketing, product, lifecycle, or finance. Handoff is where most experimentation programs quietly lose their value, so the owner of a winning idea is named before the test ever runs.
I've Run Programs on ICE,
PIE, and RICE.
RICE won — but not because it's the most sophisticated. Prioritization frameworks sit on a tradeoff between how complex they are and how useful they are. Every hour a team spends scoring ideas is an hour it isn't shipping experiments, and a model elaborate enough to feel "rigorous" is usually a planning tax in disguise.
RICE earns its extra letter: Reach forces the sample-size and feasibility conversation on day one. An idea with beautiful impact scores but no traffic to detect the effect dies in scoring instead of dying six weeks into a doomed test. ICE and PIE let that conversation happen too late.
Just as important: a transparent score defuses loudest-stakeholder bias. Teams can see exactly why an idea moved forward or didn't. But the score structures judgment — it never replaces it.
Fast, lightweight. Good for early-stage programs — but nothing stops a zero-traffic idea from scoring high.
Page-centric and subjective — "potential" and "importance" overlap enough that scores drift toward whoever's in the room.
Reach makes measurement feasibility a first-class input. Slightly more work to score, materially better decisions — the right point on the complexity/usefulness curve.
Measuring What a Program
Is Actually Worth
To make experimentation legible to the C-suite, every verified win runs through a financial impact model:
It's a good executive-communication tool — and taken alone, it flatters the program. A credible experimentation leader tells the CFO where the model is wrong before the CFO finds out:
Winner's curse
Tests that reach significance tend to overestimate their true lift — you selected them because they looked good. Observed lifts get shrinkage applied before anyone annualizes them.
Lift decay & novelty effects
A 46-day lift is not a 12-month lift. Novelty fades, competitors respond, audiences saturate. Naive annualization is the most common way programs inflate their own impact.
Baseline drift & seasonality
Baseline conversion rates move with seasons, pricing, and market conditions. A static baseline in the formula quietly misattributes market movement to the program.
Revenue is not margin
Revenue impact and contribution-margin impact are different numbers, and the CFO cares about the second one. Where margin data exists, the model should use it.
Interaction effects
Ten concurrent winners rarely sum. Overlapping audiences and compounding changes mean the whole is usually less than the sum of the readouts.
Long-Term Holdouts: The Program-Level Audit
The strongest correction for all five failure modes is the same one platforms like Eppo and Statsig have productized: long-term holdouts. Keep a small slice of the audience on the old experience for a quarter or more, ship winners to everyone else, and measure the cumulative gap. That single number captures lift decay, novelty effects, and interaction between concurrent winners — the true, durable value of the program rather than the sum of its optimistic readouts.
Per-test models make the program legible quarter to quarter. Holdouts keep it honest year to year. A mature program runs both.
There Is No
One-Size-Fits-All
Everything above is a starting architecture, not a template. The right version of this system depends on team size, statistical maturity, tooling, company politics, the OKRs and KPIs each team is actually measured on, bandwidth, and who owns what. A five-person growth team and a five-brand enterprise need very different amounts of process — and installing more governance than a team can absorb kills velocity just as surely as having none.
This is also the most common failure mode of hiring it out: agencies that install the same templated process at every client, regardless of how the team actually works. The mismatch shows up as friction — between teams, and inside them — and the program gets blamed for what was really a fit problem. The operating system has to serve the team. Never the other way around.
Each experiment feeds the next. Behavioral insights don't expire — they compound into a permanent intelligence advantage over competitors who are still guessing.
Two Ways to Use
This System
Building a program?
Audits, 90-day sprints, and advisory retainers — I build the experimentation operating system with your team, shaped to how your team actually works.
Work with me → For Hiring TeamsHiring for experimentation or growth?
I lead growth experimentation functions — the operating model, the measurement standards, and the executive decision layer. See the case studies under Work, or reach out directly.
Get in touch →You Don't Need to Learn PRISM. You Need Someone to Run It.
The framework works. I've run it 500+ times across Fortune 150 and startups. The question is whether you want to build this capability internally — or get results now.
The average company leaves $2M+ in unrealized revenue from unoptimized funnels. One experiment that moves your conversion rate 0.5% could recover $500K. The question isn't cost — it's what it costs you to keep guessing.
Advisory
Starting at $5,000/mo
For teams that have the execution capacity but need the behavioral science expertise and experimentation strategy.
- Monthly strategy sessions
- Experimentation roadmap & prioritization
- Behavioral audit of your funnel
- Hypothesis frameworks for your team
- Async Slack/email support
- Month-to-month after first 90 days
Done-for-You
Starting at $15,000/mo
I embed with your team and run the entire experimentation program. You get the strategist, the executor, and the system — not a junior analyst.
- Everything in Advisory
- Full experiment design, build & QA
- Revenue forecasting per experiment
- Executive reporting & stakeholder alignment
- 10–15 experiments per quarter
- Direct access — no account managers
Half the cost of a top-tier agency. The person with $30M+ in results does the work — not a junior they hired last month.
Ready to Stop Guessing?
Tell me about your growth challenge. No pitch decks, no sales reps — just a direct conversation about whether I can help.
Revenue Frameworks
for Growth Leaders
Every week: one experiment, one framework, one insight to make your marketing more evidence-based and your revenue more predictable.
Free · No spam · Unsubscribe anytime
Read the archive
200+ issues of experiments, frameworks, and field reports from inside a Fortune 150 growth team.
Open Substack (opens in new tab)