Designing a Scalable Experiment Tracking System: The Learning Compound Rate

TL;DR: A scalable experiment tracking system isn't measured by how many tests you can run — it's measured by how many past tests a team can still apply. Most programs leak 60-80% of their institutional learning within 18 months. Here's the framework that stops the leak.

Key Takeaways

  • Scalability in experimentation isn't about test volume — it's about how efficiently past learnings inform new decisions
  • The Learning Compound Rate (LCR) measures the percentage of past experiments a team can recall and apply when designing new tests
  • Most programs have an LCR below 30% within 18 months, meaning two-thirds of past learning is effectively lost
  • Knowledge is a depreciating asset: without active preservation through structured archives, institutional memory decays at a predictable rate
  • A scalable system has three layers — capture, retrieval, and meta-analysis — and most orgs invest in the first while neglecting the other two

Test Volume Is the Wrong Scaling Metric

The usual framing for experimentation scalability is throughput: how many tests per quarter can the team run? This framing misses the point.

A team running 50 tests per quarter where nobody remembers the results of last quarter's tests isn't scaling. It's running a treadmill. The throughput is real but the learning isn't compounding. And the entire premise of an experimentation program is that learning should compound — each test's insight should make the next test's hypothesis sharper.

The real scaling metric is the Learning Compound Rate: the percentage of past experiments that still actively inform decisions today. When this number is low, you're running experiments for nothing. The insights are produced and then lost.

This connects to a well-studied principle in organizational economics: knowledge is a depreciating asset. Gary Becker's work on human capital showed that skills decay without active use. The same applies to experimental knowledge — insights decay without active retrieval into new decisions.

"If you're rerunning tests you forgot you ran, you don't have an archive. You have wasted resources." — Atticus Li

What Gets Lost

Institutional memory in experimentation programs decays in predictable categories:

Hypothesis archives. What did we test last quarter? Unless you can produce the full list in under two minutes, this is already lost.

Result context. Even when results are recorded, the context — why we expected this, what was surprising, what we concluded — is usually documented in chat threads or ephemeral documents that age out.

Failed tests. The most valuable archive category gets the worst treatment. Tests that didn't win often contain the highest learning density (they tell you where your mental model is wrong), but they're the first category teams stop documenting.

Decision traces. Why did we ship variant B over variant A when both were close? That reasoning is almost never captured, and it's the exact information you need the next time a similar close call comes up.

Without active preservation, each category decays on a different timeline. Hypotheses and results can last 6-12 months in accessible form. Context and decision traces often decay within weeks.

The Learning Compound Rate

Here's the formula:

LCR = (Past experiments applied to new decisions in the last quarter) / (Total experiments completed in the past 2 years)

Applied means: a team member looked up, referenced, or built on a past experiment when designing a new test, writing a hypothesis, or making a product decision.

Interpretation thresholds:

  • LCR above 60% — Strong institutional memory. Your archive is genuinely part of the decision process.
  • LCR between 30% and 60% — Typical for well-run programs with some discipline around archives.
  • LCR between 10% and 30% — Most programs land here. Archives exist but aren't woven into how decisions get made.
  • LCR below 10% — Your experimentation program is producing learning that immediately evaporates. Past tests might as well not have run.

The threshold matters because LCR compounds with time. A program operating at 60% LCR for three years has a knowledge base that's substantively different from one operating at 15% LCR. The former team makes sharper hypotheses and catches traps earlier. The latter team keeps rediscovering the same patterns.

How to Measure Your LCR

A simple audit:

Step 1 — Pull the last 10 new experiment proposals. Recent proposals only.

Step 2 — For each, ask: did the author reference a past experiment in the hypothesis, design, or expected result? Referenced means: cited a specific prior test, pulled a metric baseline from an archive, built on a past finding.

Step 3 — Count the proposals that referenced at least one past experiment. Divide by 10.

Most teams doing this audit find their LCR is 2-4 out of 10 — somewhere in the 20-40% range. They also find that the 2-4 referenced tests are usually recent (last quarter), and anything older than six months might as well not exist.

The Three Layers of a Scalable System

A scalable tracking system has three layers. Most orgs invest heavily in the first, partially in the second, and almost nothing in the third.

Layer 1 — Capture. Standardized experiment entry: hypothesis, primary metric, variants, audience, dates, expected impact. This is table stakes. If you can't capture consistently, nothing else works. Template-based intake reduces capture overhead and enforces consistency.

Layer 2 — Retrieval. The ability to find past tests when you need them. This is where most programs fail. Capture without retrieval is like writing in invisible ink. Retrieval requires: tagging by feature area, funnel stage, and hypothesis type; full-text search over hypothesis and result text; filters by outcome (win / loss / inconclusive) and by date range.

Layer 3 — Meta-analysis. Looking across many past tests to find patterns. Which hypothesis types have the highest win rate? Which funnel stages show diminishing returns? Where are we consistently wrong? Meta-analysis produces second-order insights that no individual test can produce, and it requires that layers 1 and 2 are strong enough to support it.

Investing in layer 3 without layers 1 and 2 produces nothing. Investing in layer 1 without layers 2 and 3 produces a graveyard.

Building the System

For capture: A single intake template used for every experiment. Required fields: hypothesis, primary metric, guardrails, audience definition, expected sample size, expected directional lift. Optional fields: behavioral science principle being leveraged, cross-team stakeholders, related past tests.

For retrieval: Tag normalization (a controlled vocabulary of tags, not free-text) and search across hypothesis and insight text. The goal is that any team member can find relevant past tests in under two minutes from a cold start.

For meta-analysis: A quarterly review that asks: what patterns exist in our archive? What's our win rate by funnel stage? By hypothesis type? What have we stopped testing because we always lose there? This produces the second-order insights that sharpen the next quarter's roadmap.

Common Mistakes in Scalable Tracking

Over-standardization in early stages. Requiring 20 fields on every experiment intake creates intake friction that kills adoption. Start with 5-7 required fields and expand only when teams are completing them reliably.

Free-text tags. "Pricing", "pricing experiment", "Pricing tests", "$" — all the same concept, none searchable together. A controlled vocabulary solves this.

Failed tests unarchived. The cultural discipline of documenting failures as thoroughly as wins takes deliberate effort. Without it, your archive is systematically biased toward wins and teaches you nothing about where your hypotheses are wrong.

No retrieval loop in design. If the design phase of a new test doesn't include "what have we learned that applies here?", the archive isn't being used. This step has to be explicit in the process.

Advanced: Meta-Analysis Patterns Worth Tracking

Once your LCR is above 40%, meta-analysis starts producing patterns that change roadmap decisions:

Win rate by funnel stage. Most programs have a funnel stage where win rates systematically drop. This is useful information for prioritization.

Diminishing returns by feature area. A feature area where you've run 15 tests and had 12 losses tells you something. Stop testing it and redirect.

Hypothesis type performance. Tests based on qualitative research (user interviews, session recordings) often have different win rates than tests based on quantitative patterns (funnel drops, heatmaps). Knowing your ratio is valuable.

Seasonality effects. Some hypotheses win only in specific seasons or contexts. Meta-analysis surfaces these patterns in a way no individual test can.

Frequently Asked Questions

What's the minimum viable tracking system?

A single spreadsheet or wiki page with: hypothesis, primary metric, dates, variant descriptions, result, one-sentence insight. This is the floor. Below this, you have no institutional memory at all.

How do I get team adoption on structured tracking?

The strongest driver is when the archive visibly informs new decisions. When a team member pulls up a past test in a design meeting and it changes the design, adoption follows. Mandates don't work; visible utility does.

Should every test go in the archive?

Yes. Including inconclusive tests and cancelled tests. The archive's value comes from completeness — selective archiving biases the record and undermines meta-analysis.

How often should we do meta-analysis?

Quarterly is the right cadence for most teams. Monthly creates churn without enough data to see patterns. Annually loses the temporal signal entirely.

What's the single highest-leverage investment to improve LCR?

Making past tests searchable from within the new test intake flow. When designing a test, the team should see relevant past tests automatically surfaced. This shifts retrieval from "effortful lookup" to "default awareness" and changes LCR structurally.

Methodology note: LCR thresholds and tracking system patterns reflect experience across mid-market experimentation programs. Specific figures are presented as ranges. Knowledge depreciation framing draws on human capital theory from Becker and others.

---

See what a well-structured experiment archive looks like in practice. Browse the GrowthLayer test library — real experiments organized by funnel stage, behavioral pattern, and outcome.

Related reading:

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.