Products

Solutions

Resources

Docs Pricing

Products

Solutions

Resources

Products

Solutions

Resources

Experiment backlog: Prioritization at scale

Mon Jun 23 2025

You know that sinking feeling when you look at your experiment backlog and it's somehow grown to 200+ items overnight? Yeah, me too. It starts innocently enough - a few A/B tests here, some feature flags there - but before you know it, you're drowning in a sea of "high priority" experiments that all needed to run yesterday.

The truth is, most companies hit this wall around the 50-experiment mark. That's when the wheels start coming off: teams step on each other's tests, nobody knows what's actually running, and your data starts looking like abstract art. Let's talk about how to fix this mess before it gets worse.

The challenges of managing experiment backlogs at scale

Here's what nobody tells you about scaling experimentation: the problems compound exponentially, not linearly. When you go from 10 to 100 experiments, you don't get 10x the headaches - you get 100x.

The first pain point hits when multiple teams start running experiments. Suddenly, you've got the marketing team testing pricing while engineering is messing with the checkout flow. Both tests affect conversion rates, but good luck figuring out which one actually moved the needle. The team at Booking.com learned this the hard way - they now run over 1,000 concurrent experiments, but only after building systems to detect and prevent conflicts.

Then there's the data integrity nightmare. With dozens of experiments running simultaneously, your analytics start resembling a game of telephone. One misconfigured event can cascade through multiple tests, turning your carefully crafted experiments into expensive guesswork. Netflix's engineering team documented how they caught data discrepancies affecting 15% of their experiments - and that was with a mature testing infrastructure.

The complexity snowball keeps rolling. Modern experiments rarely touch just one service. You might start with a simple button color test, but it needs to work across web, mobile apps, and that legacy system everyone pretends doesn't exist. Each additional platform doubles your coordination overhead and triples your debugging time.

The scariest part? Most teams don't realize they have these problems until it's too late. By then, you're spending more time managing the mess than actually learning from experiments.

Effective prioritization techniques for experiment backlogs

Let's be honest: those fancy prioritization frameworks everyone talks about? They're only as good as the data you feed them. But when used right, they can turn your chaotic backlog into something actually manageable.

The RICE framework (Reach, Impact, Confidence, Effort) works great for product-focused experiments. Here's the thing though - you need to be ruthless about scoring. If everything is high-impact, nothing is. At Spotify, they modified RICE to include a "strategic alignment" score, which helped them cut their active experiments by 40% while actually increasing their learning velocity.

For a simpler approach, try this three-bucket system:

Ship it: Experiments that directly impact this quarter's OKRs
Test it: Interesting ideas that need validation
Shelf it: Everything else (yes, even that CEO's pet project)

The MoSCoW method adds another layer, but here's a pro tip: your "Won't have" list is just as important as your "Must have" list. It's where you explicitly park all those "great ideas" that would derail your focus. Google's growth team credits their success partly to maintaining a public "Not doing" list that's twice as long as their active experiments.

Regular backlog grooming isn't optional - it's survival. Set a weekly 30-minute slot where you:

Kill experiments that have been "about to start" for over a month
Merge similar tests (you'd be amazed how often teams propose variants of the same idea)
Escalate blockers before they become emergencies

Remember: a smaller backlog of high-quality experiments beats a massive list of maybes every time.

Cultivating a scalable experimentation culture

Culture eats strategy for breakfast, and nowhere is this truer than in experimentation. You can have all the tools and processes in the world, but if your culture doesn't support testing, you're dead in the water.

The biggest culture killer? Punishing "failed" experiments. I've seen teams celebrate only the wins, which inevitably leads to people gaming the system. They'll run safe tests with predictable outcomes or, worse, cherry-pick metrics until something looks positive. Microsoft learned this lesson and now celebrates "high-quality failures" - experiments that definitively prove a hypothesis wrong.

Here's what actually works for building experimentation culture:

Democratize access: If only data scientists can run tests, you'll never scale
Share learnings publicly: Airbnb's experiment review emails are legendary for a reason
Make it stupid simple: The easier it is to launch a test, the more people will do it
Set learning goals, not just business goals: "We'll understand user behavior around X" is perfectly valid

The collaboration piece is crucial but often bungled. Don't create another steering committee or approval board. Instead, build lightweight ways for teams to see what others are testing. Statsig's experiment repository feature, for example, lets teams browse active experiments and flag potential conflicts without adding meetings to everyone's calendar.

One counterintuitive tip: start saying no to experiments. When leadership sees you rejecting low-quality tests, they'll start taking the process seriously. It signals that experimentation is a discipline, not just throwing spaghetti at the wall.

Building technical infrastructure for large-scale experimentation

Let's cut through the vendor pitches and talk about what you actually need to run experiments at scale. Spoiler alert: it's not as complicated as the enterprise software salespeople want you to believe.

Your infrastructure needs to nail three things, in this order:

Reliability: Tests run correctly 99.9% of the time
Speed: Results available within hours, not weeks
Self-service: Teams can launch without engineering help

The automation piece isn't optional once you hit scale. Manual test setup is like hand-copying books in the age of printing presses - technically possible but absurdly inefficient. Focus automation efforts on:

Experiment setup and targeting
Data quality checks (catch those tracking errors early)
Results calculation and statistical significance
Cleanup and ramping decisions

For data handling, here's the uncomfortable truth: you'll need 10x the capacity you think you need. Experiments generate massive amounts of data, especially when you start segmenting results. The team at Uber shared that their experimentation platform processes over 1 billion events daily - and they're not even the biggest player out there.

Security often gets treated as an afterthought until something goes wrong. Build these protections in from day one:

Audit logs for every experiment change
PII handling for user-level data
Rate limiting to prevent runaway tests
Automatic killswitches for misbehaving experiments

The best infrastructure is invisible to end users. If product managers need to understand your data pipeline to run a test, you've already failed. Tools like Statsig abstract away the complexity while maintaining the robustness you need at scale.

Closing thoughts

Managing experiments at scale is hard, but it's not impossible. The key is recognizing that it's not just a technical problem - it's equal parts process, culture, and infrastructure. Start small, be ruthless about prioritization, and invest in the boring stuff (like data quality) before it becomes a crisis.

If you're looking to dig deeper, check out:

Booking.com's experimentation platform papers
Netflix's engineering blog on A/B testing
The Statsig guide to scaling experimentation culture

Remember: every company that's successfully running hundreds of experiments started exactly where you are now. The only difference? They started building for scale before they needed it.

Hope you find this useful!

Permalink: https://www.statsig.com/perspectives/experiment-backlog-prioritization-scale

Products

Solutions

Resources

Products

Solutions

Resources

Docs

Pricing

Back to Perspectives home

The Statsig Team

Experiment backlog: Prioritization at scale

The challenges of managing experiment backlogs at scale

Effective prioritization techniques for experiment backlogs

Cultivating a scalable experimentation culture

Building technical infrastructure for large-scale experimentation

Closing thoughts

Recent Posts

Optimizing cloud compute costs with GKE and compute classes

Pablo Beltran

How Statsig lets you ship, measure, and optimize AI-generated code

Sid Kumar, Brock Lumbard

Your users are your best benchmark: a guide to testing and optimizing AI products

Skye Scofield

The more the merrier? The problem of multiple comparisons in A/B Testing

Allon Korem, Oryah Lancry-Dayan

Randomization: The ABC’s of A/B Testing

Allon Korem, Oryah Lancry-Dayan

Speeding up A/B tests with discipline

Yuzheng Sun, PhD