Underlying AB testing is the concept of “randomized controlled trials (RCTs).” It is the gold standard in finding causality.
Below is the famous hierarchy of evidences pyramid. Essentially, the only form of evidence that is stronger than RCTs is a meta study of RCTs. Presenting an RCT in an argument settles the argument.
There are two technical insights that enables the power of RCTs
With a large enough sample, randomization cancels out biases – this is called the law of large numbers. This makes sure that we don’t need to care about differences in the observable and unobservable variables with a large sample – randomization will take care of it.
With randomized assignments, the difference between the treatment group and the control group is caused by the treatment.
“Caused by the treatment” is a super strong statement. In most comparisons, studies without RCTs, the difference between two groups is usually a result of the selection bias instead of the treatment.
Let’s use one quick example, which also illustrates what “random assignment” is and its importance.
Suppose I claim that I have a magic pill that costs $100 and can increase the height of high school students by 1 inch over a year. I will show you two true results from my study:
Test group: 1000 students who voluntarily took the pill a year ago. Their average height was 60 inches a year ago and 62 inches this year.
Control group: 1000 students from the same schools with the same age. Their average height was 60 inches a year ago and 61 inches this year.
Can we conclude that this pill is effective? We all know that such a magic pill doesn’t exist, but what’s the loophole in this study?
The loophole in this study is “selection bias.” People are (self) selected into the treatment group. Those who volunteer into the study may come from wealthier families, as they can afford the pill, or they are more eager to grow taller and may have tried other things besides taking the pill. Any such factor will destroy the causality in this study.
But if we have 2000 students, then assign the pill randomly, we remove the select bias. By the law of large numbers, the average metrics (height, wealth, growth of height, eagerness to grow, etc.) of these two groups should be the same, and the difference in their height growth is guaranteed to be caused by the treatment – the pill.
Taking this example to product development, we can see why we can make such mistakes every day if we don’t have the mindset of AB testing. For example
Selection bias in time series:
Claim: We shipped a feature and metrics increased 10%
Reality: The metrics will increase 10% without the feature, such as shipping a Black Friday banner before Black Friday.
Selection bias in cross sections:
Claim: We shipped a feature, and users who use the feature saw 10% increase in their metrics
Reality: The users who self-select into using the feature would see a 10% increase without the feature, such as giving a button to power users (ref: why most aha moments are wrong?)
Beyond causality, AB testing is also a powerful measurement too. Peter Drucker said “If you can’t measure it, you can’t change it.” This is especially true in large companies with lots of management frictions.
Our customer story with Recroom is a great example. The company did a great UI revamp but saw a 30%+ decrease in their key metric. Without AB testing, they wouldn’t have noticed it.
Product development is not a one time work. It is a continuous iteration that accumulates small wins. But you can’t win if you can’t measure wins against losses. Once people start doing AB testing, they found out that 70% - 90% of their ideas actually don’t work.
Consequently, people who don’t do AB testing will ship many bad ideas without knowing it.
In short, AB testing is powerful and important because
Humans are bad at attributions and are subject to lots of biases
Humans are bad at predicting the outcome of their ideas
AB testing provides the necessary measurement and causality and keeps us honest with reality.
Standard deviation and variance are essential for understanding data spread, evaluating probabilities, and making informed decisions. Read More ⇾
We’ve expanded our SRM debugging capabilities to allow customers to define custom user dimensions for analysis. Read More ⇾
Detect interaction effects between concurrent A/B tests with Statsig's new feature to ensure accurate experiment results and avoid misleading metric shifts. Read More ⇾
Statsig's biggest year yet: groundbreaking launches, global events, record scaling, and exciting plans for 2025. Explore our 2024 milestones and what’s next! Read More ⇾
A guide to reporting A/B test results: What are common mistakes and how can you make sure to get it right? Read More ⇾
This guide explains why the allocation point may differ from the exposure point, how it happens, and what you to do about it. Read More ⇾