Non-inferiority: Proving features aren't worse

Mon Jun 23 2025

Picture this: you've spent months building a sleeker checkout flow that's way easier to use, but when you run the A/B test, conversion drops by 0.5%. Do you scrap all that work because it's not "better"?

This is where most teams get stuck - they're using the wrong tool for the job. Traditional A/B tests are designed to prove one thing beats another, but sometimes you just need to know your changes won't tank your metrics while you chase other benefits like better UX or lower costs.

Understanding non-inferiority tests in experimentation

Non-inferiority tests flip the script on traditional experimentation. Instead of asking "is this better?", they ask "is this good enough?"

Think about it - not every change needs to boost your metrics. Maybe you're switching to a cheaper infrastructure provider, redesigning for accessibility compliance, or simplifying code that's become a maintenance nightmare. In these cases, you'd probably accept a small performance hit for the other benefits you're getting.

The magic happens when you define your non-inferiority margin - basically, how much worse you're willing to let things get. Say your current checkout converts at 3%. You might decide that anything above 2.9% is acceptable if it means cutting page load time in half. That 0.1% difference becomes your margin.

Here's what makes this approach so powerful for product teams:

  • You can finally justify those quality-of-life improvements

  • Legal and compliance changes become less scary

  • Technical debt paydown gets easier to prioritize

  • Design updates don't need to show immediate metric lifts

The team at Stitch Fix discovered this when updating their recommendation algorithms. As they explain in their engineering blog, they often needed to prove new models weren't worse than existing ones before considering other factors like computational efficiency or interpretability.

Designing effective non-inferiority experiments

Setting up these tests isn't rocket science, but you need to nail three things: your margin, your sample size, and your success criteria.

First up - picking that non-inferiority margin. This is where teams usually mess up. They either go too conservative (0.01% difference) and need millions of users, or too liberal (5% difference) and risk actual damage. The sweet spot? Ask yourself: "What's the biggest drop in this metric I could explain to my boss without sweating?"

Your hypothesis setup looks different too. Instead of the usual "variant beats control," you're testing "variant isn't worse than control by more than X%." It sounds like a small change, but it fundamentally shifts how you calculate sample sizes and interpret results.

Speaking of sample sizes - brace yourself, because you'll need more users than a standard test. The team at NEPHJC found that non-inferiority tests often require 20-30% larger samples to detect those smaller differences within your margin. This makes sense when you think about it - proving something isn't worse requires more precision than proving it's wildly better.

Here's a practical checklist for your next non-inferiority test:

  • Define your margin based on business impact, not statistical convenience

  • Calculate sample size using non-inferiority formulas (not standard A/B calculators)

  • Choose metrics that directly measure what you care about

  • Document why non-inferiority makes sense for this specific change

  • Get stakeholder buy-in on the margin before you start

Interpreting and analyzing non-inferiority test results

Results are in, and now comes the tricky part - figuring out what they actually mean.

The key is understanding confidence intervals in this new context. In a regular A/B test, you want your confidence interval to exclude zero and sit entirely above it. For non-inferiority, you want that interval to sit entirely above your negative margin. So if your margin is -0.1% and your confidence interval is [-0.05%, +0.15%], congratulations - you've proven non-inferiority.

But watch out for these common traps:

The biocreep problem is real and insidious. Say you prove version 2 is non-inferior to version 1 with a 1% margin. Then version 3 is non-inferior to version 2. Before you know it, version 10 is actually 10% worse than your original, but each step seemed fine. Always compare back to a stable baseline, not just the most recent version.

Non-inferiority doesn't mean equivalent. Your new version might actually be worse - just not worse enough to matter. Reddit's statistics community regularly points out this confusion when teams claim their changes have "no impact" based on non-inferiority tests alone.

When presenting results, be crystal clear about what you've proven. Say "the new design performs within 0.5% of the original" not "the new design performs the same." Your stakeholders need to understand they're making a trade-off, even if it's a good one.

Applying non-inferiority testing in product development

Let's get practical - where does this actually fit in your development process?

UI updates are the obvious starting point. When your design team creates a cleaner interface, you can use non-inferiority tests to ensure engagement doesn't tank while you improve usability. Infrastructure changes are another perfect use case - proving your migration to a new database doesn't hurt performance lets you capture cost savings confidently.

Integrating these tests into your workflow doesn't require overhauling everything. Start by identifying changes where "not making things worse" is the real success criteria:

  • Performance optimizations that trade small metric drops for big speed gains

  • Accessibility improvements required for compliance

  • Technical migrations and refactoring projects

  • Simplifying complex features for better user experience

The key is setting expectations early. When planning these projects, explicitly state you're targeting non-inferiority, not improvement. This prevents the awkward conversation later when metrics don't move up and to the right.

Statsig makes this particularly straightforward by supporting non-inferiority tests natively in their platform. You can define your margins upfront and get clear readouts on whether you've achieved non-inferiority - no manual statistics required.

Remember to revisit your margins periodically. As your product matures and metrics stabilize, you might tighten margins to maintain quality. Or as you enter new markets, you might loosen them to move faster while learning.

Closing thoughts

Non-inferiority testing is one of those tools that seems niche until you need it - then you wonder how you lived without it. It's not about lowering your standards; it's about having the right standards for different situations.

The next time someone proposes a change that makes your product better in ways that don't show up in conversion rates, don't dismiss it. Run a non-inferiority test and make a real trade-off decision based on data.

Want to dive deeper? Check out Statsig's guide on understanding non-inferiority tests or the excellent technical breakdown from Stitch Fix's engineering team.

Hope you find this useful! And remember - sometimes "good enough" really is good enough, especially when it comes with other benefits your users actually care about.



Please select at least one blog to continue.

Recent Posts

We use cookies to ensure you get the best experience on our website.
Privacy Policy