Products

Solutions

Resources

Docs Pricing

Products

Solutions

Resources

Products

Solutions

Resources

Bonferroni Test

The Bonferroni test, also known as the Bonferroni correction, is a statistical method used to counteract the problem of multiple comparisons. When conducting multiple hypothesis tests simultaneously, the likelihood of obtaining a significant result by chance alone increases. The Bonferroni test adjusts the significance level for each individual test to maintain the desired overall significance level, reducing the risk of false positives.

In scenarios involving multiple hypothesis testing, such as analyzing numerous metrics or variants in an experiment, the Bonferroni correction becomes crucial. Without proper adjustment, the probability of making a Type I error (rejecting a true null hypothesis) increases rapidly with the number of tests performed. By applying the Bonferroni test, researchers can control the family-wise error rate (FWER), ensuring that the probability of making at least one Type I error across all tests remains at the desired level.

The key components of the Bonferroni test include the number of hypothesis tests (m) and the desired overall significance level (α). Mathematically, the Bonferroni correction adjusts the significance level for each individual test by dividing α by m. For example, if conducting 10 tests with a desired overall significance level of 0.05, each individual test would have a significance level of 0.005 (0.05 / 10). This adjustment makes the criteria for rejecting the null hypothesis more stringent, reducing the likelihood of false positives.

How does the Bonferroni test work?

The Bonferroni test is a simple yet effective method for correcting multiple comparisons. It works by dividing the desired significance level (α) by the number of hypotheses being tested (m). This adjusted significance level (α/m) is then used as the new threshold for determining statistical significance.

For example, if you're testing 20 hypotheses with a desired α of 0.05, the Bonferroni-corrected significance level would be 0.05/20 = 0.0025. Any p-value less than 0.0025 would be considered significant after the correction.

The Bonferroni test is designed to control the family-wise error rate (FWER), which is the probability of making at least one Type I error (false positive) among all hypotheses tested. By setting a more stringent significance level, the Bonferroni correction reduces the likelihood of obtaining false positives when conducting multiple tests.

However, the Bonferroni test can be quite conservative, especially when dealing with a large number of hypotheses. As the number of tests increases, the adjusted significance level becomes smaller, making it more difficult to detect true positives (i.e., increased risk of Type II errors or false negatives).

Despite its limitations, the Bonferroni test remains a widely used method for multiple testing correction due to its simplicity and effectiveness in controlling the FWER. It is particularly useful when the number of hypotheses is relatively small, and the cost of false positives is high.

When applying the Bonferroni test, it's essential to consider the trade-off between Type I and Type II errors. While the correction helps minimize false positives, it may also increase the risk of missing true effects, particularly when dealing with a large number of tests or when the effect sizes are small.

When to use the Bonferroni test

The Bonferroni correction is useful when conducting multiple hypothesis tests simultaneously. It helps control the family-wise error rate (FWER), reducing the likelihood of Type I errors (false positives). Apply the Bonferroni test when you have a small number of comparisons and want to maintain a strict control over false positives.

One advantage of using the Bonferroni correction is its simplicity and effectiveness in controlling Type I errors. By adjusting the significance level for each individual test, it ensures that the overall FWER remains at the desired level (e.g., 0.05). This conservative approach is particularly valuable when false positives could lead to costly or harmful consequences.

However, the Bonferroni test has some limitations and drawbacks. As the number of comparisons increases, the correction becomes more conservative, potentially leading to a loss of statistical power and increased risk of Type II errors (false negatives). In situations with a large number of tests, the Bonferroni correction may be too stringent, making it difficult to detect true differences between groups.

Another consideration is the assumption of independence among the tests. The Bonferroni correction assumes that the tests are independent or have a positive dependence structure. If the tests are negatively correlated, the correction may be overly conservative. In such cases, alternative methods like the Holm-Bonferroni procedure or the Hochberg procedure may be more appropriate.

When deciding whether to use the Bonferroni test, consider the number of comparisons, the desired level of Type I error control, and the potential consequences of false positives. If you have a small number of planned comparisons and strict control over false positives is crucial, the Bonferroni correction can be a suitable choice. However, if you have a large number of tests or are concerned about loss of power, explore alternative multiple testing correction methods.

Interpreting Bonferroni test results

The Bonferroni test adjusts p-values and confidence intervals to account for multiple comparisons. This correction makes the significance threshold more stringent, reducing the risk of false positives.

When interpreting Bonferroni-corrected results, focus on the adjusted p-values and confidence intervals. These values provide a more conservative estimate of statistical significance, considering the number of hypotheses tested.

Compare the adjusted and unadjusted results to understand the impact of the correction. If a result remains significant after the Bonferroni adjustment, you can be more confident in its validity.

However, the Bonferroni test can be overly conservative, potentially leading to false negatives. If a result is not significant after the correction, it may still be worth investigating further.

When making decisions based on Bonferroni-corrected outcomes, consider the context and practical significance of the results. A statistically significant result may not always translate to a meaningful difference in practice.

Balancing the need to control for multiple comparisons with the desire to detect true effects is crucial. The Bonferroni test provides a rigorous approach, but it's not the only option.

Other methods, such as the Benjamini-Hochberg procedure, offer a more powerful alternative for controlling the false discovery rate. These approaches can be particularly useful when dealing with a large number of hypotheses.

Ultimately, interpreting the results of a Bonferroni test requires careful consideration of the research question, the number of comparisons made, and the practical implications of the findings. By understanding the strengths and limitations of this correction method, you can make informed decisions based on your experimental results.

Alternatives and variations

While the Bonferroni correction is a simple and effective method for controlling the family-wise error rate in multiple hypothesis testing, there are several alternatives and variations worth considering:

The Holm-Bonferroni correction is a step-down procedure that offers more power than the standard Bonferroni correction. It works by sequentially testing hypotheses ordered by their p-values, adjusting the significance level for each test based on the number of remaining hypotheses.
The Šidák correction is similar to the Bonferroni correction but assumes that the individual tests are independent. It calculates the adjusted significance level as 1 - (1 - α)^(1/m), where α is the desired family-wise error rate and m is the number of hypotheses.
The Benjamini-Hochberg procedure controls the false discovery rate (FDR) instead of the family-wise error rate. FDR is the expected proportion of false positives among all significant results. This method is less conservative than the Bonferroni correction and offers more power when testing a large number of hypotheses.
Adaptive procedures, such as the Benjamini-Hochberg-Yekutieli procedure, take into account the dependency structure among the hypotheses. These methods can provide more power than the standard Bonferroni correction when the tests are positively dependent.

When applying the Bonferroni test or its alternatives, it's crucial to strike a balance between controlling Type I errors (false positives) and minimizing Type II errors (false negatives). Being too conservative with the significance level may lead to missed discoveries, while being too lenient can result in false positives.

To find the right balance, consider the following factors:

The cost of false positives versus false negatives in your specific context. In some cases, false positives may be more detrimental than false negatives, or vice versa.
The number of hypotheses being tested. As the number of hypotheses increases, the Bonferroni correction becomes more conservative, potentially leading to a higher rate of false negatives.
The expected effect sizes. If the expected effect sizes are large, you may be able to tolerate a higher significance level without compromising the validity of your results.

By carefully considering these factors and selecting an appropriate multiple testing correction method, you can effectively control the error rates in your experiments while maximizing the power to detect true effects.

Loved by customers at every stage of growth

See what our users have to say about building with Statsig

Testimonials

"Statsig's experimentation capabilities stand apart from other platforms we've evaluated. The ease of use, simplicity of integration help us efficiently get insight from every experiment we run. Statsig's infrastructure and experimentation workflows have also been crucial in helping us scale to hundreds of experiments across hundreds of millions of users."

Paul Ellwood

Head of Data Engineering

"We evaluated Optimizely, LaunchDarkly, Split, and Eppo, but ultimately selected Statsig due to its comprehensive end-to-end integration. We wanted a complete solution rather than a partial one, including everything from the stats engine to data ingestion."

Don Browning

SVP, Data & Platform Engineering

"Excited to bring Statsig to Whatnot! We finally found a product that moves just as fast as we do and have been super impressed with how closely our teams collaborate."

Rami Khalaf

Product Engineering Manager

"Statsig has enabled us to quickly understand the impact of the features we ship."

Shannon Priem

Lead PM

"I know that we are able to impact our key business metrics in a positive way with Statsig. We are definitely heading in the right direction with Statsig."

Partha Sarathi

Director of Engineering

"Working with the Statsig team feels like we're working with a team within our own company."

Jeff To

Engineering Manager

"[Statsig] enables shipping software 10x faster, each feature can be in production from day 0 and no big bang releases are needed."

Matteo Hertel

Founder

"Statsig has been an amazing collaborator as we've scaled. Our product and engineering team have worked on everything from advanced release management to custom workflows to new experimentation features. The Statsig team is fast and incredibly focused on customer needs - mirroring OpenAI so much that they feel like an extension of our team."

Chris Beaumont

Data Scientist

"The ability to easily slice test results by different dimensions has enabled Product Managers to self-serve and uncover valuable insights."

Preethi Ramani

Chief Product Officer

"We decreased our average time to decision made for A/B tests by 7 days compared to our in-house platform."

Berengere Pohr

Team Lead - Experimentation

"Statsig is a powerful tool for experimentation that helped us go from 0 to 1."

Brooks Taylor

Data Science Lead

"We've processed over a billion events in the past year and gained amazing insights about our users using Statsig's analytics."

Ahmed Muneeb

Co-founder & CTO

"Leveraging experimentation with Statsig helped us reach profitability for the first time in our 16-year history."

Zachary Zaranka

Director of Product

"Statsig enabled us to test our ideas rather than rely on guesswork. This unlocked new learnings and wins for the team."

David Sepulveda

Head of Data

"Brex's mission is to help businesses move fast. Statsig is now helping our engineers move fast. It has been a game changer to automate the manual lift typical to running experiments and has helped product teams ship the right features to their users quickly."

Karandeep Anand

President

"We only had so many analysts. Statsig provided the necessary tools to remove the bottleneck. I know that we are able to impact our key business metrics in a positive way with Statsig. We are definitely heading in the right direction with Statsig."

Partha Sarathi

Director of Engineering

"Statsig has been a game changer for how we combine product development and A/B testing. It's made it a breeze to implement experiments with complex targeting logic and feel confident that we're getting back trusted results. It's the first commercially available A/B testing tool that feels like it was built by people who really get product experimentation."

Joel Witten

Head of Data

"We realized that Statsig was investing in the right areas that will benefit us in the long-term."

Omar Guenena

Engineering Manager

"Having a dedicated Slack channel and support was really helpful for ramping up quickly."

Michael Sheldon

Head of Data

"Statsig takes away all the pre-work of doing experiments. It's really easy to setup, also it does all the analysis."

Elaine Tiburske

Data Scientist

"We thought we didn't have the resources for an A/B testing framework, but Statsig made it achievable for a small team."

Paul Frazee

CTO

"We use Statsig's analytics to bring rigor to the decision-making process across every team at Wizehire."

Nick Carneiro

CTO

"We've successfully launched over 600 features behind Statsig feature flags, enabling us to ship at an impressive pace with confidence."

Wendy Jiao

Staff Software Engineer

"We chose Statsig because it offers a complete solution, from basic gradual rollouts to advanced experimentation techniques."

Carlos Augusto Zorrilla

Product Analytics Lead

"We have around 25 dashboards that have been built in Statsig, with about a third being built by non-technical stakeholders."

Alessio Maffeis

Engineering Manager

"Statsig beats any other tool in the market. Experimentation serves as the gateway to gaining a deeper understanding of our customers."

Toney Wen

Co-founder & CTO

"We finally had a tool we could rely on, and which enabled us to gather data intelligently."

Michael Koch

Engineering Manager

"At Notion, we're continuously learning what our users value and want every team to run experiments to learn more. It's also critical to maintain speed as a habit. Statsig's experimentation platform enables both this speed and learning for us."

Mengying Li

Data Science Manager

"At OpenAI, we want to iterate as fast as possible. Statsig enables us to grow, scale, and learn efficiently. Integrating experimentation with product analytics and feature flagging has been crucial for quickly understanding and addressing our users' top priorities."

Dave Cummings

Engineering Manager, ChatGPT

"Statsig has helped accelerate the speed at which we release new features. It enables us to launch new features quickly & turn every release into an A/B test."

Andy Glover

Engineer

"We knew upon seeing Statsig's user interface that it was something a lot of teams could use."

Laura Spencer

Chief of Staff

"The beauty is that Statsig allows us to both run experiments, but also track the impact of feature releases."

Evelina Achilli

Product Growth Manager

"Statsig is my most recommended product for PMs."

Erez Naveh

VP of Product

"Statsig helps us identify where we can have the most impact and quickly iterate on those areas."

John Lahr

Growth Product Manager

"With Warehouse Native, we add things on the fly, so if you mess up something during set up, there aren't any consequences."

Jared Bauman

Engineering Manager - Core ML

"In my decades of experience working with vendors, Statsig is one of the best."

Laura Spencer

Technical Program Manager

"Statsig is a one-stop shop for product, engineering, and data teams to come together."

Duncan Wang

Manager - Data Analytics & Experimentation

"Engineers started to realize: I can measure the magnitude of change in user behavior that happened because of something I did!"

Todd Rudak

Director, Data Science & Product Analytics

"For every feature we launch, Statsig saves us about 3-5 days of extra work."

Rafael Blay

Data Scientist

"I appreciate how easy it is to set up experiments and have all our business metrics in one place."

Paulo Mann

Senior Product Manager

We use cookies to ensure you get the best experience on our website.