Confused about p-values and hypothesis testing? Let’s play a game.

Platform

Developers

Resources

Pricing

Platform

Developers

Resources

OVERVIEW

ROLES

Build fast with Be Significant
Our exclusive startup program

GETTING STARTED

Documentation

Documentation to get you started with implementation

Walkthrough Guides

Guides to get you started with Statsig in no time

SDKs and APIs

Explore REST API and SDKs in more than 20 frameworks

Integrations

Learn more about connecting Statsig to your existing tools

A/B Testing Calculator

Looking for a place to start your A/B Test? Try out our calculator

How Statsig Works

Get under the hood and check out how Statsig scales with you

Open Source Code

All our SDKs and supporting libraries are Open Source and regularly updated

Product Updates

We ship fast, to help you ship faster. Check out all our product updates

System Status

Want to understand how reliable Statsig is as a service? Take a look at our system status

LEARN & CONNECT

Blog

Peak Velocity is our blog where we cover the latest in data

Support

Need help getting set up or have questions about our product?

Customer Stories

Find out how leading companies are using Statsig to grow

Events

Find out about our online and offline events and RSVP to them

Build vs Buy

Compare building an in-house platform vs. buying

Contact Sales

Want to connect with someone from the Sales team?

FEATURED BLOGS

Feature Flags Liberated

Gating features is a core part of the development process. And with Statsig, it's free.

How AI Companies Use Statsig

The best AI companies use Statsig to accelerate growth. Learn how you can do the same.

What is Product Observability?

Product observability means being able to monitor, control, and gain insight into all of your features.

Platform

Developers

Resources

Pricing

OVERVIEW

Statsig Blog

Peak Velocity is our blog where we cover the latest in experimentation and more

Feature Management

Ship faster and more confidently

Experimentation

Run 100s of randomized, multivariate experiments

Data Warehouse

Run experiments natively, in your warehouse

Analytics

Actionable intelligence at your fingertips

ROLES

Build fast with Be Significant
Our exclusive startup program

GETTING STARTED

Documentation

Documentation to get you started with implementation

Walkthrough Guides

Guides to get you started with Statsig in no time

SDKs and APIs

Explore REST API and SDKs in more than 20 frameworks

Integrations

Learn more about connecting Statsig to your existing tools

A/B Testing Calculator

Looking for a place to start your A/B Test? Try out our calculator

How Statsig Works

Get under the hood and check out how Statsig scales with you

Open Source Code

All our SDKs and supporting libraries are Open Source and regularly updated

Product Updates

We ship fast, to help you ship faster. Check out all our product updates

System Status

Want to understand how reliable Statsig is as a service? Take a look at our system status

LEARN & CONNECT

Blog

Peak Velocity is our blog where we cover the latest in data

Support

Need help getting set up or have questions about our product?

Customer Stories

Find out how leading companies are using Statsig to grow

Events

Find out about our online and offline events and RSVP to them

Build vs Buy

Compare building an in-house platform vs. buying

Contact Sales

Want to connect with someone from the Sales team?

FEATURED BLOGS

Experiments with Generative AI

We built a generative AI app in reactJS using OpenAI’s API and Statsig. Here’s what we learned:

Experimentation Platforms

The decision to build versus buy an experimentation and feature flagging platform is not an easy one.

CUPED Explained

CUPED is an implementation that uses pre-experiment data to explain the variance in the result data.

Confused about p-values and hypothesis testing? Let’s play a game.

Mon Apr 18 2022

You get to flip a coin and if it’s heads, you win $10. If it’s tails, I win $10.

We play twice, tails comes up twice and you owe me $20. You probably will chalk this up to bad luck; after all there’s a 25% chance a fair coin will produce this result. So you decide to play 8 more times and get 8 more tails. That’s 10 tails out of 10 flips, you have now owe me $100 and I’m grinning ear-to-ear… are you suspicious yet? You should be, the chance of this happening with a fair coin is less than 1 in a thousand (<0.1%).

Somewhere between 2 and 10 coin flips is a point where you should call bullshit. I recommend picking a high threshold so you don’t use foul words due to everyday bad luck. But you don’t want it to be too high because you’re not a sucker. I suggest you call me out if the outcome has a less than 1 in 20 chance of occurring (<5%). This means if you get 4 tails out of 4 (a 6% chance), you chalk it up to bad luck. If you get 5 tails out of 5 (a 3% chance), you decide you were cheated and call bullshit.

Congrats, you now understand Frequentist hypothesis testing! You assumed the coin was fair (the null hypothesis), and only when we ended up with a result below a reasonable threshold did we call bullshit (5 tails out of 5 flips, <5%). We rejected the null hypothesis, meaning we accept the alternate hypothesis that the coin was biased.

Congrats! You’ve just learned hypothesis testing for $50.

Major Misconceptions to Watch Out For

1. “There is a 95% chance the coin is bad.”

This is the most common misconception around p-values, confidence intervals, and hypothesis testing. Hypothesis testing does not tell us the probability we made the right decision; we simply don’t know. To know this requires information like: did the coin come from yours or my pocket? Was I just inside a magic shop? Do I have a large stack of money I’ve won from other people? While these answers should affect your estimate of the chances the coin is unfair, it’s really hard to objectively quantify it. Instead, hypothesis testing ONLY tells us that the result is odd when we assume the coin is fair.

This is directly applicable to AB testing… we don’t know the probability that a test will work and guessing only introduces bias. Instead we assume there will be no effect, and only if we see an unlikely result will we make a big deal of it. The cool thing about hypothesis testing is it’s unbiased, and doesn’t require us to estimate the chance of success (which can be a highly subjective process).

2. “There is a 5% chance we’re wrong”

We have the confusing definition of p-values and significance to blame for this. A p-value of 0.05 means that the result (and anything as extreme) has a 5% chance of occurring under the null hypothesis. In our example, we’re stating that the outcome would has a <5% chance of occurring IF the coin is fair. This is also called the false positive rate, and it is something we do know and can control, but it’s not the same as knowing the chance we’re wrong.

3. “We know how bad the coin is.”

We know that the outcome is unlikely if the coin was fair, so we concluded it must not be fair. But we don’t know how the coin truly behaves: Does it have two tails? Or is it only 60% biased? We were only able to reject the null hypothesis and conclude that the coin isn’t fair. It’s somewhat standard practice to accept the observed result (5 times out of 5 = 100%), with some margin of error as our best guess of the coin’s behavior (after rejecting the null hypothesis). But the truth is that many different degrees of biased coins could have easily produced this result.

4. “This isn’t trustworthy, we need a larger sample size.”

This misconception largely originates from AB testing leaders like Microsoft, Google, and Facebook who talk a lot about experimentation on hundreds of millions of users. Larger samples also do tend to give better tests. But statistical power is more than just sample size, it also depends on effect size. Small companies almost always see big effect sizes giving them MORE statistical power than large companies (See You Don’t Need Large Sample Sizes to Run A/B Tests). Many scientific studies are based on small sample sizes (< 20). The coinflip example required only 5 flips. The whole point of statistics is to identify which results are plausibly due to signal or noise; a small sample size has already been accounted for.

Statistical Aside —What about Peeking?

Some readers will call me out on the peeking problem which I ignored for simplicity. In a nutshell, the more times you peek at or reevaluate your results should affect your statistics. A correct way is to pick a fixed number of flips to make a decision before you start (this is called a fixed horizon test).

The smart folks at the Netflix experiment team wrote a more thorough and statistically rigorous explainer using coin flips on their blog post: Interpreting A/B test results: false positives and statistical significance). Be sure to check this out.

Thanks to ZSun Fu on Unsplash for the photo!

Featured

Actionable intelligence at your fingertips

With Statsig Analytics you can get answers in just a few clicks. No queries required.

Stay ahead of the curve

Get experimentation insights in your inbox!

Permalink: https://www.statsig.com/blog/p-values-and-hypothesis-testing

Try Statsig Today

Get started for free. Add your whole team!

Start for Free

Platform

Developers

Resources

Statsig Blog

See All Features

Feature Management

Experimentation

Data Warehouse

Analytics

Engineering

Dev Ops

Data Science

Product Management

Artificial Intelligence

Gaming

B2B Saas

E-Commerce

Build fast with Be Significant Our exclusive startup program

Documentation

Walkthrough Guides

SDKs and APIs

Integrations

A/B Testing Calculator

How Statsig Works

Open Source Code

Product Updates

System Status

Blog

Support

Customer Stories

Events

Build vs Buy

Contact Sales

Feature Flags Liberated

How AI Companies Use Statsig

What is Product Observability?

Platform

Developers

Resources

Pricing

Statsig Blog

See All Features

Feature Management

Experimentation

Data Warehouse

Analytics

Engineering

Dev Ops

Data Science

Product Management

Artificial Intelligence

Gaming

B2B Saas

E-Commerce

Build fast with Be Significant Our exclusive startup program

Documentation

Walkthrough Guides

SDKs and APIs

Integrations

A/B Testing Calculator

How Statsig Works

Open Source Code

Product Updates

System Status

Blog

Support

Customer Stories

Events

Build vs Buy

Contact Sales

Experiments with Generative AI

Experimentation Platforms

CUPED Explained

Back to blog home

Confused about p-values and hypothesis testing? Let’s play a game.

Tim Chan

You get to flip a coin and if it’s heads, you win $10. If it’s tails, I win $10.

Major Misconceptions to Watch Out For

1. “There is a 95% chance the coin is bad.”

2. “There is a 5% chance we’re wrong”

Build fast with Be Significant
Our exclusive startup program

Build fast with Be Significant
Our exclusive startup program