If your experiment backlog is full but your learning feels thin, it’s usually not a testing problem. It’s a memory problem. Teams run dozens of tests, then six months later no one can find what happened, why it happened, or whether it’s safe to try again.

A solid ab test repository fixes that, but only if people can retrieve past work fast. Search that “kind of works” still leads to duplicate experiments, repeated debates, and a steady drip of lost context.

This article breaks down how to design an experiment library (and its filters) around the way experimentation leaders actually hunt for answers: by audience, device, funnel stage, risk, and impact.

Why experiment repository search fails in real teams

!Three modern SaaS-style diagrams depicting the shift from scattered A/B testing tools to a centralized repository, practical experiment filters, and a compounding learnings flywheel for CRO institutional memory.

Most experiment “search” fails for a simple reason: it depends on remembering the exact words someone used months ago. One PM types “checkout CTA,” another wrote “place order button,” and a third titled the doc “Step 3 friction.” Keyword search can’t bridge that gap without structure.

So teams fall back to coping methods:

  • Asking in Slack and hoping the right person sees it.
  • Rebuilding context from old Jira tickets and scattered screenshots.
  • Re-running a test because it’s faster than finding the old one.

This is why Jira, Confluence, Notion, and Excel often feel fine early on, then become inadequate once the program scales. They’re good transitional storage, but they don’t behave like an experimentation hub. They lack consistent fields, enforced tagging, and reliable reporting on what the org has already learned.

A real A/B test repository functions like an experiment knowledge base. It stores past experiments with structured metadata, so retrieval doesn’t depend on tribal knowledge. It also supports an experimentation center of excellence, because you can audit quality, spot patterns, and reuse learnings across teams instead of re-litigating every hypothesis.

If you want a reference point for what “centralized, searchable” looks like, start with a testing command center style library such as https://lab.growthlayer.app/library.

The filters people actually use (and how to make them stick)

Good filters match the questions teams ask under time pressure. Not “What was the experiment name?” but “Have we tried this for mobile new users at checkout, and was it risky?”

Below are five filters that do real work, plus the design rules that keep them usable.

Audience: who the change was meant for

Audience is the fastest way to find relevant learnings across product areas. Keep it opinionated and few in number. Start with buckets teams already use: new vs returning, high-intent vs low-intent, logged-in vs logged-out, geo, plan tier.

Don’t make “audience” a free-text field. Use a controlled list, and add a short free-text note only when needed.

Device: because mobile outcomes aren’t portable

Device is a must-have filter, not a nice-to-have. Many “wins” are just mobile fixes, and many “losses” are desktop-only assumptions. At minimum: Mobile, Desktop, and Responsive (or All).

If your stack supports it, capture OS or browser only when it explains the result (example: an iOS payment sheet).

Funnel stage: the best guardrail against duplicate tests

Funnel stage makes retrieval feel obvious. When someone says “This is a checkout problem,” they should be able to filter to Checkout and see everything that touched it.

Keep stage names simple and consistent. A practical starter set:

  • Acquisition
  • Activation
  • Checkout
  • Retention (optional, if you run lifecycle tests)

Risk: so teams can judge what’s safe to repeat

Risk should reflect blast radius, not just effort. A pricing test with little engineering can still be high-risk. Use three levels (Low, Medium, High) with a one-line definition each.

Risk becomes valuable when it’s paired with notes on reversibility (can we roll back instantly?) and compliance (does it touch payments, claims, regulated content?).

Impact: the filter that prioritizes what to copy next

Impact shouldn’t be “How big was the lift?” because early in planning you don’t know that. Define impact as the potential business upside if it works (Low, Medium, High), based on traffic and funnel sensitivity.

A quick way to keep impact consistent is to tie it to the metric and surface area: top-of-funnel pages tend to be higher reach, niche settings screens tend to be lower reach.

Here’s a compact schema that teams can fill out without hating you:

Documentation standards that prevent re-runs and unlock reuse

Filters only work if the underlying documentation is consistent. The goal isn’t more writing. It’s the right facts, captured the same way every time, so storing and retrieving past experiments becomes routine.

A practical documentation minimum for every test in your experiment library:

  1. Hypothesis (one sentence, with the “because” included)
  2. Primary metric and guardrails
  3. Variants (what changed, and where)
  4. Audience and exclusions (who saw it, who didn’t)
  5. Device and funnel stage (from controlled lists)
  6. Risk and impact (from controlled lists, set at planning time)
  7. Result (Win, Loss, Inconclusive) plus effect size and direction
  8. Why we think it happened (2 to 4 bullets, not a novel)
  9. Follow-ups (ship, iterate, or park, with owners)

A failure scenario that happens more than teams admit

A growth team tests a “Buy now” button on checkout. It loses. Six months later, a different squad changes the same button again, because they can’t find the old test and the Jira ticket only says “CTA update.” The new test also loses, but now the team has burned engineering time, reset stakeholder trust, and introduced noisy metrics because the checkout flow changed in other ways.

A centralized A/B test repository prevents this in a boring, reliable way:

  • The second squad filters Funnel stage = Checkout, Device = Mobile, Impact = High.
  • They immediately see the prior test tagged Outcome = Loss, with notes that it ran during a payment provider rollout and that returning users reacted differently than new users.
  • Instead of repeating the same idea, they design a safer follow-up: segmenting by new users, adjusting payment reassurance copy, and scoping the blast radius.

That’s the real payoff. You don’t just prevent duplicate experiments. You reuse learnings across teams, with enough context to form better hypotheses.

Where an AI experimentation system helps (and where it doesn’t)

An AI experimentation system can auto-suggest tags, detect near-duplicate hypotheses, and recommend similar past tests when someone starts a new one. That reduces the “I forgot to tag it” problem.

But AI can’t rescue missing inputs. If your repository doesn’t store audience, device, stage, risk, and impact as structured fields, you’ll get fuzzy retrieval and false matches. Treat AI as an assistant, not a substitute for disciplined documentation.

Conclusion

A good A/B test repository isn’t defined by how many experiments it stores. It’s defined by how fast a new team member can find the last three relevant tests and understand what happened. Filters based on audience, device, funnel stage, risk, and impact turn your experiment library into working institutional memory, not a dusty archive.

Build the filter set people already think in, enforce a short documentation standard, and you’ll spend less time re-running old ideas and more time compounding what you’ve learned.

Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Atticus Li

Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.