Bonferroni Multiple Comparison

In the world of experimentation, making decisions based on data is crucial. But what happens when you're testing multiple hypotheses simultaneously? This is where the Bonferroni correction comes into play, helping you maintain the integrity of your results.

The Bonferroni correction is a statistical method used to adjust the significance level when conducting multiple hypothesis tests. It's designed to reduce the chances of making a Type I error (false positive) across all comparisons.

Why is this important? Imagine you're running an experiment with 20 metrics. Each metric has a 5% chance of a false positive. Without correction, the probability of getting at least one false positive is a whopping 64%!

The Bonferroni multiple comparison method addresses this by dividing the desired significance level (e.g., 0.05) by the number of tests performed. This ensures that the family-wise error rate (FWER) - the probability of making at least one Type I error - is controlled.

Here's how it works:

  • Adjusted significance level (α) = Original α / Number of tests

  • If you have 20 metrics and want an overall α of 0.05, each test's significance level would be 0.05/20 = 0.0025

By setting a more stringent threshold for statistical significance, the Bonferroni correction helps prevent false positives when testing multiple hypotheses. It's a conservative approach that prioritizes accuracy over power, making it suitable for situations where Type I errors are particularly costly.

However, it's important to note that the Bonferroni correction can be overly conservative, especially with a large number of tests. This may lead to reduced statistical power and increased Type II errors (false negatives).

Despite its limitations, the Bonferroni multiple comparison method remains a widely used and easily understood approach to controlling the FWER. It's particularly useful when the number of tests is relatively small, or when you want to be extra cautious about false positives.

The mathematics behind Bonferroni correction

The Bonferroni correction adjusts p-values by dividing the desired significance level (α) by the number of tests (m). This yields the Bonferroni-corrected p-value threshold: α/m. P-values below this adjusted threshold are considered significant.

Bonferroni correction controls the family-wise error rate (FWER) — the probability of making at least one Type I error among all hypotheses tested. By setting a more stringent significance threshold, it ensures that the FWER remains at or below the desired level (e.g., 0.05).

As the number of tests increases, the Bonferroni-corrected p-value threshold becomes more conservative. For example, with 100 tests and α=0.05, the adjusted threshold is 0.0005. This strict threshold reduces false positives but may increase false negatives.

Despite its simplicity, Bonferroni correction is often too conservative, especially with many tests. It assumes independence between tests, which is rarely true in practice. This can lead to overcorrection and loss of statistical power.

Bonferroni correction is most appropriate when the number of tests is relatively small and the cost of a false positive is high. In other scenarios, less conservative methods like the Benjamini-Hochberg procedure for controlling the false discovery rate (FDR) may be preferred.

When applying Bonferroni correction in tools like Statsig, be mindful of the number of metrics and variants in your experiment. Removing unnecessary metrics or combining similar variants can help maintain statistical power while controlling the FWER.

Applying Bonferroni correction in experimentation

To apply the Bonferroni correction in your experiments, follow these steps:

  1. Determine the desired family-wise error rate (FWER) for your experiment. This is the probability of making at least one type I error (false positive) across all hypotheses tested.

  2. Divide the FWER by the number of hypotheses (metrics or variants) to obtain the adjusted significance level. For example, if you have an FWER of 0.05 and are testing 10 metrics, the adjusted significance level would be 0.005 (0.05 / 10).

  3. Use the adjusted significance level when evaluating each hypothesis. If the p-value for a metric is less than the adjusted significance level, consider it statistically significant.

When applying the Bonferroni correction, it's important to consider the distinction between primary and secondary metrics. The primary metric is the most important measure of success for your experiment, while secondary metrics provide additional insights. To maintain statistical power for the primary metric, you can allocate a larger portion of the FWER to it, leaving the remaining portion to be divided among the secondary metrics.

Keep in mind that the Bonferroni correction is a conservative approach and can lead to an increased risk of false negatives (type II errors). By setting a more stringent significance level, you may fail to detect some true effects. This is a trade-off between controlling the FWER and maintaining statistical power.

To mitigate the impact on statistical power, consider the following strategies:

  • Limit the number of metrics and variants in your experiment to only those that are most relevant and actionable.

  • Use a hierarchical approach, where you first test the primary metric and only proceed to test secondary metrics if the primary metric is significant.

  • Employ alternative correction methods, such as the Holm-Bonferroni or Benjamini-Hochberg procedures, which are less conservative and maintain better statistical power while still controlling the FWER or false discovery rate (FDR).

By carefully applying the Bonferroni correction and considering its implications, you can ensure the reliability of your experimental results and make informed decisions based on multiple comparisons. The Bonferroni correction is a simple and widely applicable approach for multiple comparison procedures. It can be easily applied to various types of statistical tests and experimental designs. This simplicity makes it a popular choice for researchers and analysts.

However, the Bonferroni method is known for being conservative, especially when dealing with a large number of comparisons. This conservative nature can lead to an increased risk of Type II errors (false negatives), potentially failing to detect true differences between groups or treatments. In some cases, this may result in overlooking important findings.

When compared to other multiple comparison procedures, such as the Holm-Bonferroni method or the Benjamini-Hochberg procedure, the Bonferroni correction tends to have less statistical power. These alternative methods often provide a better balance between controlling the family-wise error rate and maintaining sufficient power to detect significant differences. They may be preferred in situations where a more powerful test is required.

Despite its limitations, the Bonferroni multiple comparison method remains a valuable tool in the researcher's arsenal. Its simplicity and wide applicability make it a go-to choice for many experimental designs. However, it is essential to consider the specific requirements of your study and weigh the trade-offs between Type I and Type II errors when selecting a multiple comparison procedure.

Practical examples in A/B testing

Let's explore a scenario where you're running an experiment with multiple variants and metrics. Imagine you're testing three different button colors (blue, green, purple) to see which performs best. You're measuring click-through rate (CTR) as your primary metric and conversion rate as a secondary metric.

Without Bonferroni correction, you might see statistically significant results for one or more variants. However, these results could be false positives due to the multiple comparisons problem. By applying the Bonferroni correction, you adjust the p-values to account for the increased likelihood of false positives when testing multiple hypotheses simultaneously.

After applying the Bonferroni correction, you may find that some previously significant results are no longer significant. This is because the correction makes the significance threshold more stringent. Bonferroni multiple comparison helps you avoid making decisions based on false positives, ensuring that the significant results you act upon are more likely to be genuine.

When interpreting results with Bonferroni correction, focus on the corrected p-values. If a variant's corrected p-value is below your predetermined significance level (e.g., 0.05), you can be more confident that the observed difference is real and not just due to chance. Use these corrected p-values to guide your decision-making process.

Keep in mind that while the Bonferroni correction reduces false positives, it can also increase the risk of false negatives (type II errors). This means that some truly significant results might be missed due to the more conservative approach. It's essential to strike a balance between controlling for false positives and not being overly conservative in your testing.

Loved by customers at every stage of growth

See what our users have to say about building with Statsig
OpenAI
"At OpenAI, we want to iterate as fast as possible. Statsig enables us to grow, scale, and learn efficiently. Integrating experimentation with product analytics and feature flagging has been crucial for quickly understanding and addressing our users' top priorities."
Dave Cummings
Engineering Manager, ChatGPT
SoundCloud
"We evaluated Optimizely, LaunchDarkly, Split, and Eppo, but ultimately selected Statsig due to its comprehensive end-to-end integration. We wanted a complete solution rather than a partial one, including everything from the stats engine to data ingestion."
Don Browning
SVP, Data & Platform Engineering
Recroom
"Statsig has been a game changer for how we combine product development and A/B testing. It's made it a breeze to implement experiments with complex targeting logic and feel confident that we're getting back trusted results. It's the first commercially available A/B testing tool that feels like it was built by people who really get product experimentation."
Joel Witten
Head of Data
"We knew upon seeing Statsig's user interface that it was something a lot of teams could use."
Laura Spencer
Chief of Staff
"The beauty is that Statsig allows us to both run experiments, but also track the impact of feature releases."
Evelina Achilli
Product Growth Manager
"Statsig is my most recommended product for PMs."
Erez Naveh
VP of Product
"Statsig helps us identify where we can have the most impact and quickly iterate on those areas."
John Lahr
Growth Product Manager
"The ability to easily slice test results by different dimensions has enabled Product Managers to self-serve and uncover valuable insights."
Preethi Ramani
Chief Product Officer
"We decreased our average time to decision made for A/B tests by 7 days compared to our in-house platform."
Berengere Pohr
Team Lead - Experimentation
"Statsig is a powerful tool for experimentation that helped us go from 0 to 1."
Brooks Taylor
Data Science Lead
"We've processed over a billion events in the past year and gained amazing insights about our users using Statsig's analytics."
Ahmed Muneeb
Co-founder & CTO
SoundCloud
"Leveraging experimentation with Statsig helped us reach profitability for the first time in our 16-year history."
Zachary Zaranka
Director of Product
"Statsig enabled us to test our ideas rather than rely on guesswork. This unlocked new learnings and wins for the team."
David Sepulveda
Head of Data
Brex
"Brex's mission is to help businesses move fast. Statsig is now helping our engineers move fast. It has been a game changer to automate the manual lift typical to running experiments and has helped product teams ship the right features to their users quickly."
Karandeep Anand
President
Ancestry
"We only had so many analysts. Statsig provided the necessary tools to remove the bottleneck. I know that we are able to impact our key business metrics in a positive way with Statsig. We are definitely heading in the right direction with Statsig."
Partha Sarathi
Director of Engineering
"Statsig has enabled us to quickly understand the impact of the features we ship."
Shannon Priem
Lead PM
Ancestry
"I know that we are able to impact our key business metrics in a positive way with Statsig. We are definitely heading in the right direction with Statsig."
Partha Sarathi
Director of Engineering
"Working with the Statsig team feels like we're working with a team within our own company."
Jeff To
Engineering Manager
"[Statsig] enables shipping software 10x faster, each feature can be in production from day 0 and no big bang releases are needed."
Matteo Hertel
Founder
"We use Statsig's analytics to bring rigor to the decision-making process across every team at Wizehire."
Nick Carneiro
CTO
Notion
"We've successfully launched over 600 features behind Statsig feature flags, enabling us to ship at an impressive pace with confidence."
Wendy Jiao
Staff Software Engineer
"We chose Statsig because it offers a complete solution, from basic gradual rollouts to advanced experimentation techniques."
Carlos Augusto Zorrilla
Product Analytics Lead
"We have around 25 dashboards that have been built in Statsig, with about a third being built by non-technical stakeholders."
Alessio Maffeis
Engineering Manager
"Statsig beats any other tool in the market. Experimentation serves as the gateway to gaining a deeper understanding of our customers."
Toney Wen
Co-founder & CTO
"We finally had a tool we could rely on, and which enabled us to gather data intelligently."
Michael Koch
Engineering Manager
Notion
"At Notion, we're continuously learning what our users value and want every team to run experiments to learn more. It's also critical to maintain speed as a habit. Statsig's experimentation platform enables both this speed and learning for us."
Mengying Li
Data Science Manager
Whatnot
"Excited to bring Statsig to Whatnot! We finally found a product that moves just as fast as we do and have been super impressed with how closely our teams collaborate."
Rami Khalaf
Product Engineering Manager
"We realized that Statsig was investing in the right areas that will benefit us in the long-term."
Omar Guenena
Engineering Manager
"Having a dedicated Slack channel and support was really helpful for ramping up quickly."
Michael Sheldon
Head of Data
"Statsig takes away all the pre-work of doing experiments. It's really easy to setup, also it does all the analysis."
Elaine Tiburske
Data Scientist
"We thought we didn't have the resources for an A/B testing framework, but Statsig made it achievable for a small team."
Paul Frazee
CTO
Whatnot
"With Warehouse Native, we add things on the fly, so if you mess up something during set up, there aren't any consequences."
Jared Bauman
Engineering Manager - Core ML
"In my decades of experience working with vendors, Statsig is one of the best."
Laura Spencer
Technical Program Manager
"Statsig is a one-stop shop for product, engineering, and data teams to come together."
Duncan Wang
Manager - Data Analytics & Experimentation
Whatnot
"Engineers started to realize: I can measure the magnitude of change in user behavior that happened because of something I did!"
Todd Rudak
Director, Data Science & Product Analytics
"For every feature we launch, Statsig saves us about 3-5 days of extra work."
Rafael Blay
Data Scientist
"I appreciate how easy it is to set up experiments and have all our business metrics in one place."
Paulo Mann
Senior Product Manager
We use cookies to ensure you get the best experience on our website.
Privacy Policy