Bonferroni Correction

What is Bonferroni correction?

The Bonferroni correction is a statistical method used to adjust the significance level when conducting multiple hypothesis tests simultaneously. It aims to control the family-wise error rate (FWER), which is the probability of making at least one Type I error (false positive) across all tests.

In simple terms, the Bonferroni correction helps maintain the overall significance level by dividing the desired significance level (α) by the number of tests performed. For example, if you have 10 hypothesis tests and want to maintain an overall significance level of 0.05, the Bonferroni correction would set the significance level for each individual test to 0.005 (0.05 / 10).

The importance of the Bonferroni correction lies in its ability to reduce the chances of obtaining false positives when conducting multiple comparisons. By controlling the FWER, researchers can have more confidence in the significant results they find, knowing that the probability of making at least one Type I error is limited to the desired level.

However, it's worth noting that the Bonferroni correction is a conservative approach and may lead to reduced statistical power, especially when the number of tests is large. This means that it may be more difficult to detect true significant differences, increasing the risk of Type II errors (false negatives).

Despite its limitations, the Bonferroni correction remains a widely used and straightforward method for addressing the multiple comparisons problem in various fields, including A/B testing, genomics, and social sciences.

The multiple comparisons problem

Multiple tests increase the risk of false positives. When running multiple hypothesis tests, the probability of observing a significant result by chance alone increases. This is known as the multiple comparisons problem.

The family-wise error rate (FWER) is the probability of making at least one Type I error (false positive) among all hypotheses tested. It can be calculated as: FWER = 1 - (1 - α)^m, where α is the significance level for each individual test and m is the number of hypotheses.

For example, consider an A/B test with a control and two variants, each evaluated on five metrics. With a significance level of 0.05, the FWER is 1 - (1 - 0.05)^(2*5) = 0.401. There's a 40.1% chance of observing at least one false positive.

Bonferroni correction: Controlling the FWER

The Bonferroni correction is a simple method to control the FWER. It adjusts the significance level for each individual test by dividing it by the number of hypotheses: α_corrected = α / m.

Applying the Bonferroni correction to the previous example, the adjusted significance level would be 0.05 / (2*5) = 0.005. Each individual test is evaluated at a more stringent threshold to maintain the desired FWER.

However, the Bonferroni correction can be conservative, especially with a large number of hypotheses. It may lead to reduced statistical power and increased Type II errors (false negatives).

Preferential Bonferroni correction: Prioritizing the primary metric

The preferential Bonferroni correction is a modification that gives additional weight to the primary metric. It allocates a larger portion of the overall significance level to the primary metric, while the remaining level is divided among secondary metrics.

Mathematically, the preferential Bonferroni correction assigns γkα-confidence intervals for primary metrics and (1-γ)k(m-1)α-confidence intervals for secondary metrics, where γ is the weighted allocation factor, k is the number of variants, and m is the total number of metrics.

This approach maintains higher power for detecting changes in the primary metric, even as the number of secondary metrics increases. It strikes a balance between controlling the FWER and prioritizing the most important hypothesis.

How Bonferroni correction works

The Bonferroni correction adjusts the significance level for each individual hypothesis test. It divides the desired overall significance level (α) by the number of hypotheses (m). This gives a new, more stringent significance level for each individual test: α/m.

To apply the Bonferroni correction, multiply each individual p-value by the number of hypotheses tested. If the resulting adjusted p-value is still below the desired significance level, the result is considered statistically significant. The correction also widens the confidence intervals for each individual hypothesis test.

The Bonferroni correction is simple to apply and understand. It controls the family-wise error rate (FWER), ensuring a low probability of even one false positive. However, it can be overly conservative, especially with many hypotheses, leading to reduced statistical power and increased false negatives.

Applying Bonferroni correction in experimentation

To implement the Bonferroni correction in your experiments, follow these steps:

  1. Determine the desired family-wise error rate (FWER) for your experiment. This is the probability of making at least one Type I error (false positive) across all hypothesis tests.

  2. Divide the desired FWER by the number of hypothesis tests (metrics or variants) in your experiment. This gives you the adjusted significance level for each individual test.

  3. Conduct your hypothesis tests using the adjusted significance level. If any test results in a p-value lower than the adjusted level, consider it statistically significant.

When applying the Bonferroni correction, it's essential to consider the distinction between primary and secondary metrics. Primary metrics directly measure the experiment's main objective, while secondary metrics provide additional insights. To maintain statistical power for the primary metric, you can allocate a larger portion of the FWER to it, using the preferential Bonferroni method. This approach ensures that the power to detect changes in the primary metric doesn't depend on the number of secondary metrics.

Applying the Bonferroni correction impacts statistical power and sample size requirements. As the number of hypothesis tests increases, the adjusted significance level becomes more stringent, reducing the power to detect true differences. To maintain the desired power, you'll need to increase the sample size. Use power analysis tools to determine the required sample size based on the number of metrics, variants, and the expected effect size.

Remember, the Bonferroni correction is a conservative approach that controls the FWER. It may lead to reduced power and increased sample size requirements compared to other multiple testing correction methods. Consider the trade-offs between Type I and Type II errors when deciding on the appropriate correction method for your experiments.

Alternatives to Bonferroni correction

While the Bonferroni correction is a simple and conservative approach to address multiple comparisons, other methods offer more power and precision. The Holm-Bonferroni method is a step-down procedure that applies a less stringent correction to p-values, reducing the risk of false negatives. It sequentially compares p-values to incrementally larger thresholds, potentially identifying more significant results than the Bonferroni correction.

Another alternative is the False Discovery Rate (FDR) method, which controls the expected proportion of false positives among all significant results. The FDR method is less conservative than the Bonferroni correction and is particularly useful when dealing with a large number of tests. It allows for a higher rate of false positives but provides a better balance between type I and type II errors.

Choosing the appropriate multiple comparison correction method depends on the specific goals and constraints of your analysis. If controlling the family-wise error rate is crucial and you prefer a conservative approach, the Bonferroni correction or Holm-Bonferroni method may be suitable. However, if you're willing to tolerate a higher false positive rate in exchange for increased power, the FDR method could be a better choice.

It's important to consider the trade-offs between power and false positive control when selecting a correction method. In scenarios where the cost of false positives is high, such as medical research or safety-critical applications, a more conservative approach like the Bonferroni correction may be preferred. On the other hand, in exploratory analyses or situations where false negatives are more problematic, methods like FDR can provide a better balance.

Ultimately, the choice of multiple comparison correction method should align with your research objectives and the specific characteristics of your data. By understanding the strengths and limitations of each approach, you can make an informed decision that maximizes the validity and interpretability of your results while controlling for the multiple comparisons problem.

Loved by customers at every stage of growth

See what our users have to say about building with Statsig
OpenAI
"At OpenAI, we want to iterate as fast as possible. Statsig enables us to grow, scale, and learn efficiently. Integrating experimentation with product analytics and feature flagging has been crucial for quickly understanding and addressing our users' top priorities."
Dave Cummings
Engineering Manager, ChatGPT
SoundCloud
"We evaluated Optimizely, LaunchDarkly, Split, and Eppo, but ultimately selected Statsig due to its comprehensive end-to-end integration. We wanted a complete solution rather than a partial one, including everything from the stats engine to data ingestion."
Don Browning
SVP, Data & Platform Engineering
Recroom
"Statsig has been a game changer for how we combine product development and A/B testing. It's made it a breeze to implement experiments with complex targeting logic and feel confident that we're getting back trusted results. It's the first commercially available A/B testing tool that feels like it was built by people who really get product experimentation."
Joel Witten
Head of Data
"We knew upon seeing Statsig's user interface that it was something a lot of teams could use."
Laura Spencer
Chief of Staff
"The beauty is that Statsig allows us to both run experiments, but also track the impact of feature releases."
Evelina Achilli
Product Growth Manager
"Statsig is my most recommended product for PMs."
Erez Naveh
VP of Product
"Statsig helps us identify where we can have the most impact and quickly iterate on those areas."
John Lahr
Growth Product Manager
"The ability to easily slice test results by different dimensions has enabled Product Managers to self-serve and uncover valuable insights."
Preethi Ramani
Chief Product Officer
"We decreased our average time to decision made for A/B tests by 7 days compared to our in-house platform."
Berengere Pohr
Team Lead - Experimentation
"Statsig is a powerful tool for experimentation that helped us go from 0 to 1."
Brooks Taylor
Data Science Lead
"We've processed over a billion events in the past year and gained amazing insights about our users using Statsig's analytics."
Ahmed Muneeb
Co-founder & CTO
SoundCloud
"Leveraging experimentation with Statsig helped us reach profitability for the first time in our 16-year history."
Zachary Zaranka
Director of Product
"Statsig enabled us to test our ideas rather than rely on guesswork. This unlocked new learnings and wins for the team."
David Sepulveda
Head of Data
Brex
"Brex's mission is to help businesses move fast. Statsig is now helping our engineers move fast. It has been a game changer to automate the manual lift typical to running experiments and has helped product teams ship the right features to their users quickly."
Karandeep Anand
President
Ancestry
"We only had so many analysts. Statsig provided the necessary tools to remove the bottleneck. I know that we are able to impact our key business metrics in a positive way with Statsig. We are definitely heading in the right direction with Statsig."
Partha Sarathi
Director of Engineering
"Statsig has enabled us to quickly understand the impact of the features we ship."
Shannon Priem
Lead PM
Ancestry
"I know that we are able to impact our key business metrics in a positive way with Statsig. We are definitely heading in the right direction with Statsig."
Partha Sarathi
Director of Engineering
"Working with the Statsig team feels like we're working with a team within our own company."
Jeff To
Engineering Manager
"[Statsig] enables shipping software 10x faster, each feature can be in production from day 0 and no big bang releases are needed."
Matteo Hertel
Founder
"We use Statsig's analytics to bring rigor to the decision-making process across every team at Wizehire."
Nick Carneiro
CTO
Notion
"We've successfully launched over 600 features behind Statsig feature flags, enabling us to ship at an impressive pace with confidence."
Wendy Jiao
Staff Software Engineer
"We chose Statsig because it offers a complete solution, from basic gradual rollouts to advanced experimentation techniques."
Carlos Augusto Zorrilla
Product Analytics Lead
"We have around 25 dashboards that have been built in Statsig, with about a third being built by non-technical stakeholders."
Alessio Maffeis
Engineering Manager
"Statsig beats any other tool in the market. Experimentation serves as the gateway to gaining a deeper understanding of our customers."
Toney Wen
Co-founder & CTO
"We finally had a tool we could rely on, and which enabled us to gather data intelligently."
Michael Koch
Engineering Manager
Notion
"At Notion, we're continuously learning what our users value and want every team to run experiments to learn more. It's also critical to maintain speed as a habit. Statsig's experimentation platform enables both this speed and learning for us."
Mengying Li
Data Science Manager
Whatnot
"Excited to bring Statsig to Whatnot! We finally found a product that moves just as fast as we do and have been super impressed with how closely our teams collaborate."
Rami Khalaf
Product Engineering Manager
"We realized that Statsig was investing in the right areas that will benefit us in the long-term."
Omar Guenena
Engineering Manager
"Having a dedicated Slack channel and support was really helpful for ramping up quickly."
Michael Sheldon
Head of Data
"Statsig takes away all the pre-work of doing experiments. It's really easy to setup, also it does all the analysis."
Elaine Tiburske
Data Scientist
"We thought we didn't have the resources for an A/B testing framework, but Statsig made it achievable for a small team."
Paul Frazee
CTO
Whatnot
"With Warehouse Native, we add things on the fly, so if you mess up something during set up, there aren't any consequences."
Jared Bauman
Engineering Manager - Core ML
"In my decades of experience working with vendors, Statsig is one of the best."
Laura Spencer
Technical Program Manager
"Statsig is a one-stop shop for product, engineering, and data teams to come together."
Duncan Wang
Manager - Data Analytics & Experimentation
Whatnot
"Engineers started to realize: I can measure the magnitude of change in user behavior that happened because of something I did!"
Todd Rudak
Director, Data Science & Product Analytics
"For every feature we launch, Statsig saves us about 3-5 days of extra work."
Rafael Blay
Data Scientist
"I appreciate how easy it is to set up experiments and have all our business metrics in one place."
Paulo Mann
Senior Product Manager
We use cookies to ensure you get the best experience on our website.
Privacy Policy