Products

Solutions

Resources

Docs Pricing

Products

Solutions

Resources

Products

Solutions

Resources

Controlling false discoveries: a guide to BH correction in experimentation

Tue Jan 21 2025

Ever run a bunch of experiments and felt swamped by all the significant results popping up? You're not alone! Experimentation is a fantastic way to learn and make data-driven decisions, but it comes with its own set of challenges—especially when you're testing multiple hypotheses at once.

In this blog, we'll dive into the tricky issue of multiple hypothesis testing and how it can lead to false positives. But don't worry! We'll also introduce you to the Benjamini-Hochberg correction, a handy method to keep those false discoveries in check. Let's get started!

The challenge of multiple hypothesis testing in experimentation

Testing multiple hypotheses simultaneously might seem efficient, but it can actually increase the risk of false positives, or Type I errors. Basically, the more comparisons you make, the higher the chance that some will appear significant just by luck. This is known as the multiple testing problem.

So, how do we tackle this issue? Two common approaches are controlling the Family-Wise Error Rate (FWER) and the False Discovery Rate (FDR). FWER focuses on limiting the probability of making any Type I error, while FDR controls the expected proportion of false positives among all significant results.

If we ignore the multiple testing problem, we risk drawing invalid conclusions and making misguided decisions. Imagine thinking you've found a significant effect when it's actually just random chance—that could lead to implementing ineffective changes or missing out on real opportunities. That's why applying error control methods like the Bonferroni correction or the Benjamini-Hochberg (BH) procedure is so crucial.

Choosing between FWER and FDR depends on your experiment's goals and context. FWER is more stringent, suitable when any false positive is a big deal. FDR, on the other hand, allows for more false positives but offers greater statistical power—making it advantageous in exploratory analyses or when testing lots of hypotheses.

By implementing multiple testing corrections like the BH correction—methods that platforms like Statsig provide—you can make smarter decisions based on your experiments. Controlling the FDR helps balance the trade-off between false positives and statistical power, ensuring your conclusions are robust and actionable.

Introducing the Benjamini-Hochberg correction

So, what's the Benjamini-Hochberg (BH) procedure all about? It's a handy method for controlling the False Discovery Rate (FDR) when you're juggling multiple hypotheses. It finds a sweet spot between discovery and error control, making it a great alternative to more conservative methods like the Bonferroni correction.

Here's how it works: you order your p-values from smallest to largest and compare each one to a calculated significance threshold. This threshold is based on your desired FDR, the rank of the p-value, and the total number of tests. If a p-value is below its threshold, it's considered significant.

Why is the BH correction so cool compared to traditional methods? Well:

It maintains higher statistical power, especially when dealing with lots of tests.
It allows for more discoveries while still controlling false positives.
It adapts to your data, becoming more stringent when there are fewer true positives and more lenient when there are many.

Because of these perks, the BH correction has been widely adopted in fields like computational biology, where multiple testing is the norm. Its ability to control the FDR makes it super valuable for exploratory analyses and hypothesis generation.

By incorporating the BH correction into their experimentation platforms, companies like Statsig empower users to make more informed decisions based on their data. It helps ensure you're focusing on the most promising findings while minimizing the risk of chasing false leads.

Applying BH correction in experimentation

So, how does the BH correction play out in real-life experiments? It actually enhances the power of A/B tests by increasing your ability to detect true effects. By controlling the FDR, you can make more discoveries while keeping false positives in check. This balance is crucial—it means you can maximize findings without compromising your results' integrity.

Here are some real-world examples where the BH correction shines:

Computational biology: It's used to identify differentially expressed genes while controlling for multiple testing (source).
Online experiments: Companies like Microsoft apply the BH correction to evaluate how website changes impact user behavior (source).
Baseball statistics: It helps determine which players have a significantly higher batting probability (source).

When considering the BH correction, think about your dataset's specifics and your research goals. It's particularly advantageous when dealing with a large number of hypotheses, as it maintains statistical power while controlling the FDR. However, if strict control of Type I errors is crucial—say, in high-stakes decisions—you might prefer the more conservative Bonferroni correction.

Ultimately, the choice between the BH correction and other methods depends on balancing Type I and Type II errors. By understanding each approach's strengths and limitations, you can make informed decisions and optimize your experimental design for maximum discovery and reliability.

Practical guidelines for implementing BH correction

Choosing the right method: When deciding whether to use the BH correction, consider factors like the number of hypotheses, your desired balance between power and error control, and the nature of your study. BH is particularly suitable for large-scale studies and exploratory analyses.

Putting BH into action: Here's how you can apply the BH procedure:

Order your p-values from smallest to largest.
Calculate the BH threshold for each p-value: (i/m)Q, where i is the rank, m is the total number of tests, and Q is the desired FDR.
Compare each p-value to its corresponding threshold.
Reject null hypotheses for all tests with p-values below their thresholds.

Following these steps ensures accurate implementation of the BH correction in your analysis.

Steering clear of common pitfalls: Be mindful of the assumptions underlying the BH correction, like the independence of tests. If your tests are correlated, consider alternative methods like the Benjamini-Yekutieli procedure. Also, clearly define your hypotheses and use appropriate statistical tests to ensure valid p-values for the BH procedure.

By keeping these guidelines in mind, you can effectively implement the BH correction in your research. This helps you balance discovering true positives with controlling false positives. So go ahead—embrace the power of online experiments and make data-driven decisions while carefully managing error rates.

Closing thoughts

Navigating the challenges of multiple hypothesis testing doesn't have to be daunting. The Benjamini-Hochberg correction offers a practical way to control the False Discovery Rate, helping you uncover meaningful insights without getting bogged down by false positives. Tools like Statsig incorporate these methods to make your experimentation smoother and more reliable.

If you're eager to learn more, check out the links we've sprinkled throughout this post—they're packed with valuable information. And remember, the key is to find the right balance between discovery and error control to make informed, data-driven decisions.

Permalink: https://www.statsig.com/perspectives/controlling-false-discoveries-guide

Products

Solutions

Resources

Products

Solutions

Resources

Docs

Pricing

Back to Perspectives home

The Statsig Team

Controlling false discoveries: a guide to BH correction in experimentation

The challenge of multiple hypothesis testing in experimentation

Introducing the Benjamini-Hochberg correction

Applying BH correction in experimentation

Practical guidelines for implementing BH correction

Closing thoughts

Recent Posts

Profiling Server Core: How we cut memory usage by 85%

Daniel Loomb

Correct me if I'm wrong: Navigating multiple comparison corrections in A/B Testing

Allon Korem

2 Events, 2 Audiences, 2 Tones. 1 Statsig.

Jessie Ong

Experiments with AI in the Creative Process

Cat Lee

Helping customers move faster: the story behind Statsig University

Julie Leary

Full support for Statsig Experimentation & Analytics in Microsoft Fabric

Sid Kumar, Xin Huang