You set up the experiment cautiously, collected data diligently, and celebrated when it achieved statistical significance for your KPI. You're ready to claim the win in the next performance review, only to find that launching the feature doesn't bring the desired positive impact to user experience.
The experiment failed, despite getting significant results.
This happens more often than you'd think: Successful experiments that fail to deliver meaningful real-world impact represent one of the biggest paradoxes in A/B testing.
Let's examine why this happens, and how to turn "failed" experiments into future wins.
One of the most common reasons for such a scenario is solving the wrong problem.
This happens when teams optimize for a specific metric or problem without questioning whether it’s the right problem to solve in the first place. This leads to wasted effort, resources, and sometimes even worse outcomes.
For example, a company notices that customer service calls are taking too long.
They set the KPI to the Average Call Length (ACL) with the hope that reducing this KPI would lead to higher efficiency.
Once implemented, agents rush through calls to keep them short, but more customers call back because their issues aren’t fully resolved.
Zooming out, the actual issue the company is trying to solve isn't call length—it’s customer satisfaction and issue resolution. Instead of optimizing ACL, they should focus on measuring things that are better aligned with their real business goals, like the resolution rate for first calls, or customer satisfaction scores.
Even with the right problem to solve, it’s possible that the chosen metrics don’t fully represent the full picture. It happens when teams only measure the leading indicator for the real metric, or only focus on short-term gains without noticing the long-term regression.
For example, a mobile app wants to boost daily active users (DAU), so they implement push notifications that guilt-trip users into coming back.
When they run a two-week experiment to measure the impact, they find that DAU increases as more users return to the app. But as you might imagine, this ‘DAU driver’ eventually leads to long-term failure, as users begin to feel manipulated and frustrated—leading to negative reviews, uninstalls and churn. Ultimately, brand trust erodes.
A better approach in this scenario would be to focus on meaningful engagement metrics like session length, retention rate, and customer satisfaction, in addition to just DAU.
Another important concept here is False Positive Risk (FPR), which is the probability that the statistically significant result is a false positive.
In other words, even if you set up the test perfectly, measure the correct metric and achieve a statistical significance result, there is still a chance that this result will not lead to business impact in the real word.
According to this paper about common experimentation misunderstandings, the false positive risk can be higher than what people intuitively think, based on historical success rate at different companies.
Company/Source | Success Rate | FPR | Reference |
---|---|---|---|
Microsoft | 33% | 5.9% | (Kohavi, Crook and Longbotham 2009) |
Avinash Kaushik | 20% | 11.1% | (Kaushik 2006) |
Bing | 15% | 15.0% | (Kohavi, Deng and Longbotham, et al. 2014) |
Booking.com, Google Ads, Netflix | 10% | 22.0% | (Manzi 2012, Thomke, Experimentation Works: The Surprising Power of Business Experiments 2020, Moran 2007) |
Airbnb Search | 8% | 26.4% |
The examples we talked about in the last section have another name: KPI traps.
KPI traps happen when teams focus too much on hitting KPIs without considering their unintended consequences or overall business impact. This can lead to misaligned incentives, short-sighted decision-making, or even harm to the company in the long run.
Here are some practical ways to avoid KPI traps.
Think before execution. Take time to carefully evaluate your goals and strategy before diving into running any A/B test. Think through what problems you're really trying to solve and whether your chosen metrics truly align with those objectives.
Focus on today’s objectives, not historical ones. While historical data can provide context, stay focused on your current business objectives and user needs rather than being constrained by past approaches.
Remember Netflix started in 1997 as a DVD rental service; Twitter began as a platform for finding and subscribing to podcasts; Play-Doh was originally intended to be used as wallpaper cleaner. Imagine these companies stick with the metrics they used to focus on.
Not everything is measurable. Even the most data-driven people have to admit that human behaviors are infinitely subtle and complex in ways that metrics could never adequately capture.
When Steve Jobs first released the Magic Mouse, it came with a ‘design flaw’ that would have tanked all imaginable metrics—in order to charge it, the mouse had to be turned upside down, rendering it temporarily unusable. But because of this "design flaw," people didn't use it plugged in like a regular mouse, and thus it truly felt magic.
Incorporate leading indicators to paint the whole picture. Besides (lagging) top-line metrics e.g. Revenue, Profit, etc, experimenters should also measure leading indicators describing user experience e.g. User Retention, Page Loading Time, Satisfaction Score, etc. Neglecting these indicators can lead to significant business failures. Once teams realize something is wrong, it’s often too late.
Myspace's failure is largely attributed to its heavy monetization, resulting in a cluttered and slow user interface (although they might have noticed that tradeoff as Myspace couldn’t experiment without forfeiting revenue due to their advertising deal).
Use guardrails. Guardrails metrics monitor critical aspects of the product that could be negatively affected by the changes tested. Tracking these metrics are important in order to dodge unintended consequences, like performance regression or user retention.
The more, the merrier? Not necessarily. While having multiple metrics is important, tracking too many can lead to analysis paralysis, diluted focus, and difficulty making clear decisions. The multiple comparisons problem stands out as one of the most significant downsides: When analyzing a lot of metrics simultaneously, the likelihood of finding at least one false positive result increases substantially. Decisions based on false positives can lead to misguided actions, causing a successful experiment to fail after launch.
Statistical methods like Bonferroni Correction or Benjamini-Hochberg Procedure should be implemented to adjust significance thresholds when multiple metrics are tested (j be aware that Bonferroni Correction can be pretty conservative and may reduce the ability to detect true effects).
Run holdout experiments. One of the most effective ways to validate long-term impact from shipped experiments is holdout experiments—that is, having a small portion of users who are excluded from the experiment even after it is officially launched.
This holdout group serves as a baseline for comparison against the new experience, allowing teams to track whether the experiment's benefits persist, fade, or even turn negative over time. Large companies like Disney and Uber have been using holdouts to uncover insights that might otherwise remain hidden.
Watch out for winner’s curse. Winner’s curse refers to a statistical phenomenon where the "winning" group of an experiment tends to have an overestimated effect size. False positive risk, as we mentioned above, can lead to the winner's curse by suggesting a non-existent effect and/or by exaggerating its magnitude. Statsig uses the method described in this paper to adjust for false positive risk when aggregating impacts from multiple experiments.
It won’t catch KPI traps proactively, but it can provide a more realistic estimate for adjusting for potential failures after launching a feature.
That being said, one of the best things I learned in my career is that failures can be turned into lessons. As the ancient Chinese proverb says: “Failure is the mother of success”.
This is why we suggest teams should be careful with KPI traps, but still encourage 'good' failures because they push them to innovate and explore new territory. That’s also why having a scalable experimentation system can help manage the risk so product builders can quickly iterate and avoid any huge setbacks.
Ultimately, experimentation is a journey full of ups and downs, successes and failures. It’s important to set up your experiments properly and be careful with KPI traps.
Just remember that failed experiments aren't the end—they're stepping stones to innovation. By embracing both the wins and the setbacks, we can turn failures into learnings.