What is A/B testing and why is it important?

Thu Sep 05 2024

Yuzheng Sun, PhD

Data Scientist, Statsig

AB testing is the most reliable way to get evidence.

Underlying AB testing is the concept of “randomized controlled trials (RCTs).” It is the gold standard in finding causality.

Below is the famous hierarchy of evidences pyramid. Essentially, the only form of evidence that is stronger than RCTs is a meta study of RCTs. Presenting an RCT in an argument settles the argument.

systemic reviews of meta analysis

There are two technical insights that enables the power of RCTs

  1. With a large enough sample, randomization cancels out biases – this is called the law of large numbers. This makes sure that we don’t need to care about differences in the observable and unobservable variables with a large sample – randomization will take care of it.

  2. With randomized assignments, the difference between the treatment group and the control group is caused by the treatment.

“Caused by the treatment” is a super strong statement. In most comparisons, studies without RCTs, the difference between two groups is usually a result of the selection bias instead of the treatment.

Let’s use one quick example, which also illustrates what “random assignment” is and its importance.

Understanding treatment effect with an example

Suppose I claim that I have a magic pill that costs $100 and can increase the height of high school students by 1 inch over a year. I will show you two true results from my study:

  1. Test group: 1000 students who voluntarily took the pill a year ago. Their average height was 60 inches a year ago and 62 inches this year.

  2. Control group: 1000 students from the same schools with the same age. Their average height was 60 inches a year ago and 61 inches this year.

Can we conclude that this pill is effective? We all know that such a magic pill doesn’t exist, but what’s the loophole in this study?

The loophole in this study is “selection bias.” People are (self) selected into the treatment group. Those who volunteer into the study may come from wealthier families, as they can afford the pill, or they are more eager to grow taller and may have tried other things besides taking the pill. Any such factor will destroy the causality in this study.

But if we have 2000 students, then assign the pill randomly, we remove the select bias. By the law of large numbers, the average metrics (height, wealth, growth of height, eagerness to grow, etc.) of these two groups should be the same, and the difference in their height growth is guaranteed to be caused by the treatment – the pill.

Selection biases in product development

Taking this example to product development, we can see why we can make such mistakes every day if we don’t have the mindset of AB testing. For example

Selection bias in time series: 

  • Claim: We shipped a feature and metrics increased 10% 

  • Reality: The metrics will increase 10% without the feature, such as shipping a Black Friday banner before Black Friday.

Selection bias in cross sections:

  • Claim: We shipped a feature, and users who use the feature saw 10% increase in their metrics

  • Reality: The users who self-select into using the feature would see a 10% increase without the feature, such as giving a button to power users (ref: why most aha moments are wrong?)

AB testing is a powerful measurement

Beyond causality, AB testing is also a powerful measurement too. Peter Drucker said “If you can’t measure it, you can’t change it.” This is especially true in large companies with lots of management frictions.

Our customer story with Recroom is a great example. The company did a great UI revamp but saw a 30%+ decrease in their key metric. Without AB testing, they wouldn’t have noticed it.

statsig customer story with recroom
rec room shipping features with statsig

Product development is not a one time work. It is a continuous iteration that accumulates small wins. But you can’t win if you can’t measure wins against losses. Once people start doing AB testing, they found out that 70% - 90% of their ideas actually don’t work.

Consequently, people who don’t do AB testing will ship many bad ideas without knowing it.

your product metric over time

In short, AB testing is powerful and important because

  • Humans are bad at attributions and are subject to lots of biases

  • Humans are bad at predicting the outcome of their ideas

AB testing provides the necessary measurement and causality and keeps us honest with reality.

Create a free account

You're invited to create a free Statsig account! Get started today with 2M free events. No credit card required, of course.
an enter key that says "free account"


Try Statsig Today

Get started for free. Add your whole team!
We use cookies to ensure you get the best experience on our website.
Privacy Policy