Confused about p-values and hypothesis testing? Let’s play a game.

Timothy Chan
Mon Apr 18 2022
AB-TESTING HYPOTHESIS-TESTING P-VALUE CONFIDENCE-INTERVAL STATISTICS
Photo by ZSun Fu on Unsplash

You get to flip a coin and if it’s heads, you win $10. If it’s tails, I win $10. We play twice, tails comes up twice and you owe me $20. You probably will chalk this up to bad luck; after all there’s a 25% chance a fair coin will produce this result. So you decide to play 8 more times and get 8 more tails. That’s 10 tails out of 10 flips, you have now owe me $100 and I’m grinning ear-to-ear… are you suspicious yet? You should be, the chance of this happening with a fair coin is less than 1 in a thousand (<0.1%).

Somewhere between 2 and 10 coin flips is a point where you should call bullshit. I recommend picking a high threshold so you don’t use foul words due to everyday bad luck. But you don’t want it to be too high because you’re not a sucker. I suggest you call me out if the outcome has a less than 1 in 20 chance of occurring (<5%). This means if you get 4 tails out of 4 (a 6% chance), you chalk it up to bad luck. If you get 5 tails out of 5 (a 3% chance), you decide you were cheated and call bullshit.

Congrats, you now understand Frequentist hypothesis testing! You assumed the coin was fair (the null hypothesis), and only when we ended up with a result below a reasonable threshold did we call bullshit (5 tails out of 5 flips, <5%). We rejected the null hypothesis, meaning we accept the alternate hypothesis that the coin was biased.

Congrats! You’ve just learned hypothesis testing for $50.

Major Misconceptions to Watch Out For

1. “There is a 95% chance the coin is bad.”

This is the most common misconception around p-values, confidence intervals, and hypothesis testing. Hypothesis testing does not tell us the probability we made the right decision; we simply don’t know. To know this requires information like: did the coin come from yours or my pocket? Was I just inside a magic shop? Do I have a large stack of money I’ve won from other people? While these answers should affect your estimate of the chances the coin is unfair, it’s really hard to objectively quantify it. Instead, hypothesis testing ONLY tells us that the result is odd when we assume the coin is fair.

This is directly applicable to AB testing… we don’t know the probability that a test will work and guessing only introduces bias. Instead we assume there will be no effect, and only if we see an unlikely result will we make a big deal of it. The cool thing about hypothesis testing is it’s unbiased, and doesn’t require us to estimate the chance of success (which can be a highly subjective process).

2. “There is a 5% chance we’re wrong”

We have the confusing definition of p-values and significance to blame for this. A p-value of 0.05 means that the result (and anything as extreme) has a 5% chance of occurring under the null hypothesis. In our example, we’re stating that the outcome would has a <5% chance of occurring IF the coin is fair. This is also called the false positive rate, and it is something we do know and can control, but it’s not the same as knowing the chance we’re wrong.

3. “We know how bad the coin is.”

We know that the outcome is unlikely if the coin was fair, so we concluded it must not be fair. But we don’t know how the coin truly behaves: Does it have two tails? Or is it only 60% biased? We were only able to reject the null hypothesis and conclude that the coin isn’t fair. It’s somewhat standard practice to accept the observed result (5 times out of 5 = 100%), with some margin of error as our best guess of the coin’s behavior (after rejecting the null hypothesis). But the truth is that many different degrees of biased coins could have easily produced this result.

4. “This isn’t trustworthy, we need a larger sample size.”

This misconception largely originates from AB testing leaders like Microsoft, Google, and Facebook who talk a lot about experimentation on hundreds of millions of users. Larger samples also do tend to give better tests. But statistical power is more than just sample size, it also depends on effect size. Small companies almost always see big effect sizes giving them MORE statistical power than large companies (See You Don’t Need Large Sample Sizes to Run A/B Tests). Many scientific studies are based on small sample sizes (< 20). The coinflip example required only 5 flips. The whole point of statistics is to identify which results are plausibly due to signal or noise; a small sample size has already been accounted for.

Statistical Aside —What about Peeking?

Some readers will call me out on the peeking problem which I ignored for simplicity. In a nutshell, the more times you peek at or reevaluate your results should affect your statistics. A correct way is to pick a fixed number of flips to make a decision before you start (this is called a fixed horizon test).

The smart folks at the Netflix experiment team wrote a more thorough and statistically rigorous explainer using coin flips on their blog post: Interpreting A/B test results: false positives and statistical significance). Be sure to check this out.


Try Statsig Today

Explore Statsig’s smart feature gates with built-in A/B tests, or create an account instantly and start optimizing your web and mobile applications. You can also schedule a live demo or chat with us to design a custom package for your business.

MORE POSTS

Recently published

My Summer as a Statsig Intern

RIA RAJAN

This summer I had the pleasure of joining Statsig as their first ever product design intern. This was my first college internship, and I was so excited to get some design experience. I had just finished my freshman year in college and was still working on...

Read more

Long-live the 95% Confidence Interval

TIMOTHY CHAN

The 95% confidence interval currently dominates online and scientific experimentation; it always has. Yet it’s validity and usefulness is often questioned. It’s called too conservative by some [1], and too permissive by others. It’s deemed arbitrary...

Read more

Realtime Product Observability with Apache Druid

JASON WANG

Statsig’s Journey with Druid This is the text version of the story that we shared at Druid Summit Seattle 2022. Every feature we build at Statsig serves a common goal — to help you better know about your product, and empower you to make good decisions for...

Read more

Quant vs. Qual

MARGARET-ANN SEGER

💡 How to decide between leaning on data vs. research when diagnosing and solving product problems Four heuristics I’ve found helpful when deciding between data vs. research to diagnose + solve a problem. Earth image credit of Moncast Drawing. As a PM, data...

Read more

The Importance of Default Values

TORE

Have you ever sent an email to the wrong person? Well I have. At work. From a generic support email address. To a group of our top customers. Facepalm. In March of 2018, I was working on the games team at Facebook. You may remember that month as a tumultuous...

Read more
ANNOUNCEMENT

CUPED on Statsig

CRAIG

Run experiments with more speed and accuracy We’re pleased to announce the rollout of CUPED for all our customers. Statsig will now automatically use CUPED to reduce variance and bias on experiments’ key metrics. This gives you access to a powerful experiment...

Read more

We use cookies to ensure you get the best experience on our website.

Privacy Policy