Products

Solutions

Resources

Docs Pricing

Products

Solutions

Resources

Products

Solutions

Resources

KPI traps: How "successful" experiments can still be failures

Wed Mar 05 2025

Congratulations, you just built one of the coolest features ever.

You set up the experiment cautiously, collected data diligently, and celebrated when it achieved statistical significance for your KPI. You're ready to claim the win in the next performance review, only to find that launching the feature doesn't bring the desired positive impact to user experience.

The experiment failed, despite getting significant results.

This happens more often than you'd think: Successful experiments that fail to deliver meaningful real-world impact represent one of the biggest paradoxes in A/B testing.

Let's examine why this happens, and how to turn "failed" experiments into future wins.

Success on paper, but failure in practice

One of the most common reasons for such a scenario is solving the wrong problem.

This happens when teams optimize for a specific metric or problem without questioning whether it’s the right problem to solve in the first place. This leads to wasted effort, resources, and sometimes even worse outcomes.

For example, a company notices that customer service calls are taking too long.

They set the KPI to the Average Call Length (ACL) with the hope that reducing this KPI would lead to higher efficiency.

Once implemented, agents rush through calls to keep them short, but more customers call back because their issues aren’t fully resolved.

Zooming out, the actual issue the company is trying to solve isn't call length—it’s customer satisfaction and issue resolution. Instead of optimizing ACL, they should focus on measuring things that are better aligned with their real business goals, like the resolution rate for first calls, or customer satisfaction scores.

Even with the right problem to solve, it’s possible that the chosen metrics don’t fully represent the full picture. It happens when teams only measure the leading indicator for the real metric, or only focus on short-term gains without noticing the long-term regression.

For example, a mobile app wants to boost daily active users (DAU), so they implement push notifications that guilt-trip users into coming back.

When they run a two-week experiment to measure the impact, they find that DAU increases as more users return to the app. But as you might imagine, this ‘DAU driver’ eventually leads to long-term failure, as users begin to feel manipulated and frustrated—leading to negative reviews, uninstalls and churn. Ultimately, brand trust erodes.

A better approach in this scenario would be to focus on meaningful engagement metrics like session length, retention rate, and customer satisfaction, in addition to just DAU.

False positive risk

Another important concept here is False Positive Risk (FPR), which is the probability that the statistically significant result is a false positive.

In other words, even if you set up the test perfectly, measure the correct metric and achieve a statistical significance result, there is still a chance that this result will not lead to business impact in the real word.

According to this paper about common experimentation misunderstandings, the false positive risk can be higher than what people intuitively think, based on historical success rate at different companies.

False positive risk given the success rate, p-value threshold of 0.025 (successes only), and 80% power

Company/Source	Success Rate	FPR	Reference
Microsoft	33%	5.9%	(Kohavi, Crook and Longbotham 2009)
Avinash Kaushik	20%	11.1%	(Kaushik 2006)
Bing	15%	15.0%	(Kohavi, Deng and Longbotham, et al. 2014)
Booking.com, Google Ads, Netflix	10%	22.0%	(Manzi 2012, Thomke, Experimentation Works: The Surprising Power of Business Experiments 2020, Moran 2007)
Airbnb Search	8%	26.4%	LinkedIn

How to avoid KPI traps

The examples we talked about in the last section have another name: KPI traps.

KPI traps happen when teams focus too much on hitting KPIs without considering their unintended consequences or overall business impact. This can lead to misaligned incentives, short-sighted decision-making, or even harm to the company in the long run.

Here are some practical ways to avoid KPI traps.

Align KPIs with business goals

Think before execution. Take time to carefully evaluate your goals and strategy before diving into running any A/B test. Think through what problems you're really trying to solve and whether your chosen metrics truly align with those objectives.

Focus on today’s objectives, not historical ones. While historical data can provide context, stay focused on your current business objectives and user needs rather than being constrained by past approaches.

Remember Netflix started in 1997 as a DVD rental service; Twitter began as a platform for finding and subscribing to podcasts; Play-Doh was originally intended to be used as wallpaper cleaner. Imagine these companies stick with the metrics they used to focus on.

Not everything is measurable. Even the most data-driven people have to admit that human behaviors are infinitely subtle and complex in ways that metrics could never adequately capture.

When Steve Jobs first released the Magic Mouse, it came with a ‘design flaw’ that would have tanked all imaginable metrics—in order to charge it, the mouse had to be turned upside down, rendering it temporarily unusable. But because of this "design flaw," people didn't use it plugged in like a regular mouse, and thus it truly felt magic.

Use multiple metrics including high-level objective, user behaviors, and guardrails

Incorporate leading indicators to paint the whole picture. Besides (lagging) top-line metrics e.g. Revenue, Profit, etc, experimenters should also measure leading indicators describing user experience e.g. User Retention, Page Loading Time, Satisfaction Score, etc. Neglecting these indicators can lead to significant business failures. Once teams realize something is wrong, it’s often too late.

Myspace's failure is largely attributed to its heavy monetization, resulting in a cluttered and slow user interface (although they might have noticed that tradeoff as Myspace couldn’t experiment without forfeiting revenue due to their advertising deal).

Use guardrails. Guardrails metrics monitor critical aspects of the product that could be negatively affected by the changes tested. Tracking these metrics are important in order to dodge unintended consequences, like performance regression or user retention.

The more, the merrier? Not necessarily. While having multiple metrics is important, tracking too many can lead to analysis paralysis, diluted focus, and difficulty making clear decisions. The multiple comparisons problem stands out as one of the most significant downsides: When analyzing a lot of metrics simultaneously, the likelihood of finding at least one false positive result increases substantially. Decisions based on false positives can lead to misguided actions, causing a successful experiment to fail after launch.

Statistical methods like Bonferroni Correction or Benjamini-Hochberg Procedure should be implemented to adjust significance thresholds when multiple metrics are tested (j be aware that Bonferroni Correction can be pretty conservative and may reduce the ability to detect true effects).

Monitor long-term effects for shipped experiments

Run holdout experiments. One of the most effective ways to validate long-term impact from shipped experiments is holdout experiments—that is, having a small portion of users who are excluded from the experiment even after it is officially launched.

This holdout group serves as a baseline for comparison against the new experience, allowing teams to track whether the experiment's benefits persist, fade, or even turn negative over time. Large companies like Disney and Uber have been using holdouts to uncover insights that might otherwise remain hidden.

Watch out for winner’s curse. Winner’s curse refers to a statistical phenomenon where the "winning" group of an experiment tends to have an overestimated effect size. False positive risk, as we mentioned above, can lead to the winner's curse by suggesting a non-existent effect and/or by exaggerating its magnitude. Statsig uses the method described in this paper to adjust for false positive risk when aggregating impacts from multiple experiments.

It won’t catch KPI traps proactively, but it can provide a more realistic estimate for adjusting for potential failures after launching a feature.

Closing thoughts

That being said, one of the best things I learned in my career is that failures can be turned into lessons. As the ancient Chinese proverb says: “Failure is the mother of success”.

This is why we suggest teams should be careful with KPI traps, but still encourage 'good' failures because they push them to innovate and explore new territory. That’s also why having a scalable experimentation system can help manage the risk so product builders can quickly iterate and avoid any huge setbacks.

Ultimately, experimentation is a journey full of ups and downs, successes and failures. It’s important to set up your experiments properly and be careful with KPI traps.

Just remember that failed experiments aren't the end—they're stepping stones to innovation. By embracing both the wins and the setbacks, we can turn failures into learnings.

Request a demo

Statsig's experts are on standby to answer any questions about experimentation at your organization.

Grab a Demo

Permalink: https://www.statsig.com/blog/kpi-traps-successful-experiments-can-fail

Products

Solutions

Resources

Products

Solutions

Resources

Docs

Pricing

Back to Blog home

Lin Jia

KPI traps: How "successful" experiments can still be failures

Congratulations, you just built one of the coolest features ever.

Success on paper, but failure in practice

False positive risk

False positive risk given the success rate, p-value threshold of 0.025 (successes only), and 80% power

How to avoid KPI traps

Align KPIs with business goals

Use multiple metrics including high-level objective, user behaviors, and guardrails

Monitor long-term effects for shipped experiments

Closing thoughts

Request a demo

Recent Posts

Optimizing cloud compute costs with GKE and compute classes

Pablo Beltran

How Statsig lets you ship, measure, and optimize AI-generated code

Sid Kumar, Brock Lumbard

Your users are your best benchmark: a guide to testing and optimizing AI products

Skye Scofield

The more the merrier? The problem of multiple comparisons in A/B Testing

Allon Korem, Oryah Lancry-Dayan

Randomization: The ABC’s of A/B Testing

Allon Korem, Oryah Lancry-Dayan

Speeding up A/B tests with discipline

Yuzheng Sun, PhD