As companies become more comfortable with running AB tests, they often consider simultaneously testing more than one variation. It’s called a multi-arm experiment, also known as multi-group and A/B/n (ABn) testing (but should not be confused with multi-armed bandit experiments, hopefully the topic of a later blog post).
Multi-arm testing involves two or more test groups (eg. A and B), and a control group (eg. C). This allows comparing A against C at the same time as B against C, thus reusing your control group. This also affords a head-to-head comparison between the test groups (A vs B) to evaluate differences or identify a clear statistical winner.
This is a really powerful tool in an experimentalist’s arsenal that can reduce sample size, costs, and time, while testing multiple hypotheses in parallel. However, there is one major pitfalls to watch out for.
When I worked in drug research, our biologists would present in vitro (test tube) results during our weekly meetings. Their slides would show the results from the 20 or so latest experimental drug molecules, all compared to a control molecule (typically a competitor’s drug) complete with p-values and confidence intervals. Every week we got excited over any molecule which showed statistically-significant results, but were frequently disappointed when we couldn’t reproduce those results the following week. This is called a Type I error (also known as a false positive) and occurs when a significant difference arises due to statistical randomness and not any actual meaningful difference.
We shouldn’t have been surprised; We set our significance level (⍺) to 0.05 which means each comparison had a 5% chance of showing significant results due to random chance when no difference actually exists. Testing 20 compounds per week all but guaranteed we would make this error weekly. There are two common solutions to this problem, but I recommend a third, more practical solution.
Most statisticians (and textbooks) will suggest you apply a Bonferroni correction to the significance level. You do this by lowering the significance level (⍺) by dividing it by the number of comparisons (or number of test groups). If you are running two test groups with ⍺ = 0.05, you should cut your significance by two, to 2.5%. If you are running 20 trials, you should cut your significance level by 20, from 5% to 0.25%. This lowers the chance of a false positive on any individual comparison but maintains the overall type I error rate across the experiment. In my drug research example above, this means we would have made a Type I error once every 20 weeks, instead of almost every week. This doesn’t come for free, applying the Bonferroni correction raises the chance of making a Type II error (false negative) where a material difference goes undetected.
The Bonferroni correction is risk-adverse solution that errs on avoiding Type I errors at the expense of Type IIÂ errors.
This is what we did in my drug discovery anecdote, we retried the experiment to confirm results. The chance of making another Type I error is 5% (⍺). The chance of making 2 Type I errors in a row is 0.25% (⍺²). Repeating the experiment is the best way when experiments are cheap and quick. But product development experiments take weeks, if not months, and reproducing results may not be possible if you don’t have a fresh batch of unexposed users.
In product development, it’s competitively important to gather directional data to make quick decisions and improve the product for your users. All hypothesis testing involves trading off accuracy for speed and it’s important to be thoughtful about how we set the statistical test parameters (ie. ⍺). It’s also important to understand that lowering Type I errors comes at the cost of raising Type II errors.
The Bonferroni correction (solution #1) purposely trades off Type I errors against Type II errors. When you’re making drugs that could kill people, taking a cautious stance is absolutely warranted. But deciding whether to go with the red or blue button, or the new shopping layout vs the old presents a very different risk. Repeating experiments (solution #2), typically takes too much time. And if your first experiment was ramped up to a large percentage of your user base, it’ll be a challenge to find untainted users.
My recommendation is to consider whether you are comfortable with increasing your Type I error rates in favor of not missing an actual difference (Type II errors). Such an approach can be applied to any hypothesis test, but in my experience, is especially relevant to multi-arm experiments. I’ve typically seen multi-arm experiments deployed when there’s evidence or a strong belief that the control/default experience needs to be replaced. Perhaps you want to test a 2.0 of your UI redesign, but have 3 different versions. Or you want to add a new feature to your ML algorithm, but want to pick the ideal tuning parameters. In these scenarios, it may not make sense to give the control group an unfair advantage by lowering your significance level (⍺).
Increasing Type I error rates is most suitable when:
You have prior knowledge or data that the control group is suboptimal.
The real objective of the experiment is to determine the best test group.
Your team/company is already committed to making a change.
At the same time, you don’t want to be at the mercy of statistical noise. This can thrash your user experience, trigger unknown secondary effects, and/or create extra product work. As a rule of thumb, if you use ⍺=0.05, you should feel comfortable running up to 4 variations. This slightly biases you towards making a change, but keeps the overall Type I error rates at a reasonable level (under 20%). If you want to try more variations, I do suggest raising your bar or you’ll end up with Type I errors occurring more frequently than not. 10 experiments with ⍺=0.05 results in Type I errors occurring 40% of the time (but you can call that 10 x 0.05 = 50% for simplicity).
Rule of thumb: Up to 4 variations can be run at a significance level of ⍺=0.05; any more and you should probably lower your significance threshold.
I recommend using ⍺=0.05 by default, but there are situations where it’s worth changing it. Multi-arm experiments can be such a situation and it’s important to acknowledge and understand the tradeoff between Type I and Type II errors. If you want to be cautious and maintain your Type I error rates at 5%, use a Bonferroni correction but realize you’re increasing your Type II error rates. I suggest maintaining ⍺=0.05 for individual comparisons when running 4 or fewer test groups in a multi-arm experiment, particularly when you don’t want bias the results too strongly to the control group.
Interested in running multi-arm experiments? Statsig can help. Check us out at https://statsig.com. May all your tests be appropriately significant.
Thanks to Jonathan Chng on Unsplash for the runners photo!
Learn key insights from Ronny Kohavi and Allon Korem on building a strong experimentation culture, infrastructure, and learning from failures in A/B testing.
Optimizely was the first web experience platform to gain considerable market share, but a lot has changed since then.
My first few months at Statsig were full of hackathons, team building, and some seriously cool projects. Find out what makes Statsig's culture special.
From Marketplace failures to a game-changing A/B test, Deltoid and causal evidence reshaped Facebook's product strategies as well as my own beliefs.
A/B testing is the most reliable way to get evidence. Whether you're an advanced experimenter, or delving into testing for the first time, here's what you should know:
Kayak reacted quickly to news coverage of airline-related catastrophes and gave its Aircraft Filter feature more visibility, resulting in a 15x increase in user engagement.