Pitfalls of Multi-arm Experiments

Timothy Chan
Tue May 04 2021
EXPERIMENTATION HYPOTHESIS-TESTING AB-TESTING DATA-SCIENCE LEAN-STARTUP

Dealing with Significance (⍺) for Multiple Test Groups

As companies become more comfortable with running AB tests, they often consider simultaneously testing more than one variation. It’s called a multi-arm experiment, also known as multi-group and A/B/n (ABn) testing (but should not be confused with multi-armed bandit experiments, hopefully the topic of a later blog post). Multi-arm testing involves two or more test groups (eg. A and B), and a control group (eg. C). This allows comparing A against C at the same time as B against C, thus reusing your control group. This also affords a head-to-head comparison between the test groups (A vs B) to evaluate differences or identify a clear statistical winner. This is a really powerful tool in an experimentalist’s arsenal that can reduce sample size, costs, and time, while testing multiple hypotheses in parallel. However, there is one major pitfalls to watch out for.

Multiple test groups, which one will win? — Photo by Jonathan Chng on Unsplash

Significance is Much More Significant

When I worked in drug research, our biologists would present in vitro (test tube) results during our weekly meetings. Their slides would show the results from the 20 or so latest experimental drug molecules, all compared to a control molecule (typically a competitor’s drug) complete with p-values and confidence intervals. Every week we got excited over any molecule which showed statistically-significant results, but were frequently disappointed when we couldn’t reproduce those results the following week. This is called a Type I error (also known as a false positive) and occurs when a significant difference arises due to statistical randomness and not any actual meaningful difference.

We shouldn’t have been surprised; We set our significance level (⍺) to 0.05 which means each comparison had a 5% chance of showing significant results due to random chance when no difference actually exists. Testing 20 compounds per week all but guaranteed we would make this error weekly. There are two common solutions to this problem, but I recommend a third, more practical solution.

Solution #1: Apply a Bonferroni correction

Most statisticians (and textbooks) will suggest you apply a Bonferroni correction to the significance level. You do this by lowering the significance level (⍺) by dividing it by the number of comparisons (or number of test groups). If you are running two test groups with ⍺ = 0.05, you should cut your significance by two, to 2.5%. If you are running 20 trials, you should cut your significance level by 20, from 5% to 0.25%. This lowers the chance of a false positive on any individual comparison but maintains the overall type I error rate across the experiment. In my drug research example above, this means we would have made a Type I error once every 20 weeks, instead of almost every week. This doesn’t come for free, applying the Bonferroni correction raises the chance of making a Type II error (false negative) where a material difference goes undetected.

The Bonferroni correction is risk-adverse solution that errs on avoiding Type I errors at the expense of Type II errors.

Solution #2: Repeat your experiment

This is what we did in my drug discovery anecdote, we retried the experiment to confirm results. The chance of making another Type I error is 5% (⍺). The chance of making 2 Type I errors in a row is 0.25% (⍺²). Repeating the experiment is the best way when experiments are cheap and quick. But product development experiments take weeks, if not months, and reproducing results may not be possible if you don’t have a fresh batch of unexposed users.

Solution #3: Lower your Type II error rates by accepting a higher Type I error rate (my recommendation)

In product development, it’s competitively important to gather directional data to make quick decisions and improve the product for your users. All hypothesis testing involves trading off accuracy for speed and it’s important to be thoughtful about how we set the statistical test parameters (ie. ⍺). It’s also important to understand that lowering Type I errors comes at the cost of raising Type II errors.

The Bonferroni correction (solution #1) purposely trades off Type I errors against Type II errors. When you’re making drugs that could kill people, taking a cautious stance is absolutely warranted. But deciding whether to go with the red or blue button, or the new shopping layout vs the old presents a very different risk. Repeating experiments (solution #2), typically takes too much time. And if your first experiment was ramped up to a large percentage of your user base, it’ll be a challenge to find untainted users.

My recommendation is to consider whether you are comfortable with increasing your Type I error rates in favor of not missing an actual difference (Type II errors). Such an approach can be applied to any hypothesis test, but in my experience, is especially relevant to multi-arm experiments. I’ve typically seen multi-arm experiments deployed when there’s evidence or a strong belief that the control/default experience needs to be replaced. Perhaps you want to test a 2.0 of your UI redesign, but have 3 different versions. Or you want to add a new feature to your ML algorithm, but want to pick the ideal tuning parameters. In these scenarios, it may not make sense to give the control group an unfair advantage by lowering your significance level (⍺).

Increasing Type I error rates is most suitable when:

  1. You have prior knowledge or data that the control group is suboptimal.
  2. The real objective of the experiment is to determine the best test group.
  3. Your team/company is already committed to making a change.

At the same time, you don’t want to be at the mercy of statistical noise. This can thrash your user experience, trigger unknown secondary effects, and/or create extra product work. As a rule of thumb, if you use ⍺=0.05, you should feel comfortable running up to 4 variations. This slightly biases you towards making a change, but keeps the overall Type I error rates at a reasonable level (under 20%). If you want to try more variations, I do suggest raising your bar or you’ll end up with Type I errors occurring more frequently than not. 10 experiments with ⍺=0.05 results in Type I errors occurring 40% of the time (but you can call that 10 x 0.05 = 50% for simplicity).

Rule of thumb: Up to 4 variations can be run at a significance level of ⍺=0.05; any more and you should probably lower your significance threshold.

Conclusion

I recommend using ⍺=0.05 by default, but there are situations where it’s worth changing it. Multi-arm experiments can be such a situation and it’s important to acknowledge and understand the tradeoff between Type I and Type II errors. If you want to be cautious and maintain your Type I error rates at 5%, use a Bonferroni correction but realize you’re increasing your Type II error rates. I suggest maintaining ⍺=0.05 for individual comparisons when running 4 or fewer test groups in a multi-arm experiment, particularly when you don’t want bias the results too strongly to the control group.

Interested in running multi-arm experiments? Statsig can help. Check us out at https://statsig.com. May all your tests be appropriately significant.


Try Statsig Today

Explore Statsig’s smart feature gates with built-in A/B tests, or create an account instantly and start optimizing your web and mobile applications. You can also schedule a live demo or chat with us to design a custom package for your business.

MORE POSTS

Recently published

My Summer as a Statsig Intern

RIA RAJAN

This summer I had the pleasure of joining Statsig as their first ever product design intern. This was my first college internship, and I was so excited to get some design experience. I had just finished my freshman year in college and was still working on...

Read more

Long-live the 95% Confidence Interval

TIMOTHY CHAN

The 95% confidence interval currently dominates online and scientific experimentation; it always has. Yet it’s validity and usefulness is often questioned. It’s called too conservative by some [1], and too permissive by others. It’s deemed arbitrary...

Read more

Realtime Product Observability with Apache Druid

JASON WANG

Statsig’s Journey with Druid This is the text version of the story that we shared at Druid Summit Seattle 2022. Every feature we build at Statsig serves a common goal — to help you better know about your product, and empower you to make good decisions for...

Read more

Quant vs. Qual

MARGARET-ANN SEGER

💡 How to decide between leaning on data vs. research when diagnosing and solving product problems Four heuristics I’ve found helpful when deciding between data vs. research to diagnose + solve a problem. Earth image credit of Moncast Drawing. As a PM, data...

Read more

The Importance of Default Values

TORE

Have you ever sent an email to the wrong person? Well I have. At work. From a generic support email address. To a group of our top customers. Facepalm. In March of 2018, I was working on the games team at Facebook. You may remember that month as a tumultuous...

Read more
ANNOUNCEMENT

CUPED on Statsig

CRAIG

Run experiments with more speed and accuracy We’re pleased to announce the rollout of CUPED for all our customers. Statsig will now automatically use CUPED to reduce variance and bias on experiments’ key metrics. This gives you access to a powerful experiment...

Read more

We use cookies to ensure you get the best experience on our website.

Privacy Policy