We play twice, tails comes up twice and you owe me $20. You probably will chalk this up to bad luck; after all there’s a 25% chance a fair coin will produce this result. So you decide to play 8 more times and get 8 more tails. That’s 10 tails out of 10 flips, you have now owe me $100 and I’m grinning ear-to-ear… are you suspicious yet? You should be, the chance of this happening with a fair coin is less than 1 in a thousand (<0.1%).
Somewhere between 2 and 10 coin flips is a point where you should call bullshit. I recommend picking a high threshold so you don’t use foul words due to everyday bad luck. But you don’t want it to be too high because you’re not a sucker. I suggest you call me out if the outcome has a less than 1 in 20 chance of occurring (<5%). This means if you get 4 tails out of 4 (a 6% chance), you chalk it up to bad luck. If you get 5 tails out of 5 (a 3% chance), you decide you were cheated and call bullshit.
Congrats, you now understand Frequentist hypothesis testing! You assumed the coin was fair (the null hypothesis), and only when we ended up with a result below a reasonable threshold did we call bullshit (5 tails out of 5 flips, <5%). We rejected the null hypothesis, meaning we accept the alternate hypothesis that the coin was biased.
Congrats! You’ve just learned hypothesis testing for $50.
This is the most common misconception around p-values, confidence intervals, and hypothesis testing. Hypothesis testing does not tell us the probability we made the right decision; we simply don’t know. To know this requires information like: did the coin come from yours or my pocket? Was I just inside a magic shop? Do I have a large stack of money I’ve won from other people? While these answers should affect your estimate of the chances the coin is unfair, it’s really hard to objectively quantify it. Instead, hypothesis testing ONLY tells us that the result is odd when we assume the coin is fair.
This is directly applicable to AB testing… we don’t know the probability that a test will work and guessing only introduces bias. Instead we assume there will be no effect, and only if we see an unlikely result will we make a big deal of it. The cool thing about hypothesis testing is it’s unbiased, and doesn’t require us to estimate the chance of success (which can be a highly subjective process).
We have the confusing definition of p-values and significance to blame for this. A p-value of 0.05 means that the result (and anything as extreme) has a 5% chance of occurring under the null hypothesis. In our example, we’re stating that the outcome would has a <5% chance of occurring IF the coin is fair. This is also called the false positive rate, and it is something we do know and can control, but it’s not the same as knowing the chance we’re wrong.
We know that the outcome is unlikely if the coin was fair, so we concluded it must not be fair. But we don’t know how the coin truly behaves: Does it have two tails? Or is it only 60% biased? We were only able to reject the null hypothesis and conclude that the coin isn’t fair. It’s somewhat standard practice to accept the observed result (5 times out of 5 = 100%), with some margin of error as our best guess of the coin’s behavior (after rejecting the null hypothesis). But the truth is that many different degrees of biased coins could have easily produced this result.
This misconception largely originates from AB testing leaders like Microsoft, Google, and Facebook who talk a lot about experimentation on hundreds of millions of users. Larger samples also do tend to give better tests. But statistical power is more than just sample size, it also depends on effect size. Small companies almost always see big effect sizes giving them MORE statistical power than large companies (See You Don’t Need Large Sample Sizes to Run A/B Tests). Many scientific studies are based on small sample sizes (< 20). The coinflip example required only 5 flips. The whole point of statistics is to identify which results are plausibly due to signal or noise; a small sample size has already been accounted for.
Some readers will call me out on the peeking problem which I ignored for simplicity. In a nutshell, the more times you peek at or reevaluate your results should affect your statistics. A correct way is to pick a fixed number of flips to make a decision before you start (this is called a fixed horizon test).
The smart folks at the Netflix experiment team wrote a more thorough and statistically rigorous explainer using coin flips on their blog post: Interpreting A/B test results: false positives and statistical significance). Be sure to check this out.
Statsig's biggest year yet: groundbreaking launches, global events, record scaling, and exciting plans for 2025. Explore our 2024 milestones and what’s next! Read More ⇾
A guide to reporting A/B test results: What are common mistakes and how can you make sure to get it right? Read More ⇾
Understand the difference between one-tailed and two-tailed tests. This guide will help you choose between using a one-tailed or two-tailed hypothesis! Read More ⇾
This guide explains why the allocation point may differ from the exposure point, how it happens, and what you to do about it. Read More ⇾
From continuous integration and deployment to a scrappy, results-driven mindset, learn how we prioritize speed and precision to deliver results quickly and safely Read More ⇾
The Statsig <> Azure AI Integration is a powerful solution for configuring, measuring, and optimizing AI applications. Read More ⇾