The **Bonferroni correction** is a statistical method used to adjust the significance level when conducting multiple hypothesis tests simultaneously. It aims to control the **family-wise error rate (FWER)**, which is the probability of making at least one Type I error (false positive) across all tests.

In simple terms, the Bonferroni correction helps maintain the overall significance level by dividing the desired significance level (Î±) by the number of tests performed. For example, if you have 10 hypothesis tests and want to maintain an overall significance level of 0.05, the Bonferroni correction would set the significance level for each individual test to 0.005 (0.05 / 10).

The importance of the Bonferroni correction lies in its ability to reduce the chances of obtaining false positives when conducting multiple comparisons. By controlling the FWER, researchers can have more confidence in the significant results they find, knowing that the probability of making at least one Type I error is limited to the desired level.

However, it's worth noting that the Bonferroni correction is a conservative approach and may lead to reduced statistical power, especially when the number of tests is large. This means that it may be more difficult to detect true significant differences, increasing the risk of Type II errors (false negatives).

Despite its limitations, the Bonferroni correction remains a widely used and straightforward method for addressing the multiple comparisons problem in various fields, including A/B testing, genomics, and social sciences.

Multiple tests** increase the risk of false positives.** When running multiple hypothesis tests, the probability of observing a significant result by chance alone increases. This is known as the multiple comparisons problem.

The **family-wise error rate (FWER)** is the probability of making at least one Type I error (false positive) among all hypotheses tested. It can be calculated as: FWER = 1 - (1 - Î±)^m, where Î± is the significance level for each individual test and m is the number of hypotheses.

For example, consider an A/B test with a control and two variants, each evaluated on five metrics. With a significance level of 0.05, the FWER is 1 - (1 - 0.05)^(2*5) = 0.401. **There's a 40.1% chance of observing at least one false positive.**

The Bonferroni correction is a simple method to control the FWER. It adjusts the significance level for each individual test by dividing it by the number of hypotheses: Î±_corrected = Î± / m.

Applying the Bonferroni correction to the previous example, the adjusted significance level would be 0.05 / (2*5) = 0.005. **Each individual test is evaluated at a more stringent threshold to maintain the desired FWER.**

However, the Bonferroni correction can be conservative, especially with a large number of hypotheses. It may lead to reduced statistical power and increased Type II errors (false negatives).

The preferential Bonferroni correction is a modification that gives additional weight to the primary metric. It allocates a larger portion of the overall significance level to the primary metric, while the remaining level is divided among secondary metrics.

Mathematically, the preferential Bonferroni correction assigns Î³kÎ±-confidence intervals for primary metrics and (1-Î³)k(m-1)Î±-confidence intervals for secondary metrics, where Î³ is the weighted allocation factor, k is the number of variants, and m is the total number of metrics.

**This approach maintains higher power for detecting changes in the primary metric, even as the number of secondary metrics increases.** It strikes a balance between controlling the FWER and prioritizing the most important hypothesis.

The Bonferroni correction adjusts the significance level for each individual hypothesis test. It divides the desired overall significance level (Î±) by the number of hypotheses (m). This gives a new, more stringent significance level for each individual test: Î±/m.

To apply the Bonferroni correction, multiply each individual p-value by the number of hypotheses tested. If the resulting adjusted p-value is still below the desired significance level, the result is considered statistically significant. The correction also widens the confidence intervals for each individual hypothesis test.

The Bonferroni correction is simple to apply and understand. It controls the family-wise error rate (FWER), ensuring a low probability of even one false positive. However, it can be overly conservative, especially with many hypotheses, leading to reduced statistical power and increased false negatives.

To implement the Bonferroni correction in your experiments, follow these steps:

Determine the desired

**family-wise error rate (FWER)**for your experiment. This is the probability of making at least one Type I error (false positive) across all hypothesis tests.Divide the desired FWER by the number of hypothesis tests (metrics or variants) in your experiment. This gives you the

**adjusted significance level**for each individual test.Conduct your hypothesis tests using the adjusted significance level. If any test results in a p-value lower than the adjusted level, consider it statistically significant.

When applying the Bonferroni correction, it's essential to consider the distinction between primary and secondary metrics. Primary metrics directly measure the experiment's main objective, while secondary metrics provide additional insights. To maintain statistical power for the primary metric, you can allocate a larger portion of the FWER to it, using the **preferential Bonferroni method**. This approach ensures that the power to detect changes in the primary metric doesn't depend on the number of secondary metrics.

Applying the Bonferroni correction impacts statistical power and sample size requirements. As the number of hypothesis tests increases, the adjusted significance level becomes more stringent, reducing the power to detect true differences. To maintain the desired power, you'll need to increase the sample size. Use power analysis tools to determine the required sample size based on the number of metrics, variants, and the expected effect size.

Remember, the Bonferroni correction is a conservative approach that controls the FWER. It may lead to reduced power and increased sample size requirements compared to other multiple testing correction methods. Consider the trade-offs between Type I and Type II errors when deciding on the appropriate correction method for your experiments.

While the Bonferroni correction is a simple and conservative approach to address multiple comparisons, other methods offer more power and precision. The Holm-Bonferroni method is a step-down procedure that applies a less stringent correction to p-values, reducing the risk of false negatives. It sequentially compares p-values to incrementally larger thresholds, potentially identifying more significant results than the Bonferroni correction.

Another alternative is the False Discovery Rate (FDR) method, which controls the expected proportion of false positives among all significant results. The FDR method is less conservative than the Bonferroni correction and is particularly useful when dealing with a large number of tests. It allows for a higher rate of false positives but provides a better balance between type I and type II errors.

Choosing the appropriate multiple comparison correction method depends on the specific goals and constraints of your analysis. If controlling the family-wise error rate is crucial and you prefer a conservative approach, the Bonferroni correction or Holm-Bonferroni method may be suitable. However, if you're willing to tolerate a higher false positive rate in exchange for increased power, the FDR method could be a better choice.

It's important to consider the trade-offs between power and false positive control when selecting a correction method. In scenarios where the cost of false positives is high, such as medical research or safety-critical applications, a more conservative approach like the Bonferroni correction may be preferred. On the other hand, in exploratory analyses or situations where false negatives are more problematic, methods like FDR can provide a better balance.

Ultimately, the choice of multiple comparison correction method should align with your research objectives and the specific characteristics of your data. By understanding the strengths and limitations of each approach, you can make an informed decision that maximizes the validity and interpretability of your results while controlling for the multiple comparisons problem.

Connect with like-minded product leaders, data scientists,
and engineers to share the latest in product experimentation.

At OpenAI, we want to iterate as fast as possible. **Statsig enables us to grow, scale, and learn efficiently**. Integrating experimentation with product analytics and feature flagging has been crucial for quickly understanding and addressing our users' top priorities.

OpenAI

Engineering Manager, ChatGPT

Brex's mission is to help businesses move fast. **Statsig is now helping our engineers move fast**. It has been a game changer to automate the manual lift typical to running experiments and has helped product teams ship the right features to their users quickly.

Brex

President

At Notion, we're continuously learning what our users value and want every team to run experiments to learn more. Itâ€™s also critical to maintain speed as a habit. **Statsig's experimentation platform enables both this speed and learning for us**.

Notion

Data Science Manager

We evaluated Optimizely, LaunchDarkly, Split, and Eppo, but **ultimately selected Statsig due to its comprehensive end-to-end integration**. We wanted a complete solution rather than a partial one, including everything from the stats engine to data ingestion.

SoundCloud

SVP, Data & Platform Engineering

We only had so many analysts. Statsig provided the necessary tools to remove the bottleneck. I know that we are able to impact our key business metrics in a positive way with Statsig. **We are definitely heading in the right direction with Statsig**.

Ancestry

Director of Engineering