Calculating Sample Sizes for A/B Tests

Timothy Chan
Fri Apr 02 2021
DATA-SCIENCE AB-TESTING STATISTICS HYPOTHESIS-TESTING
Photo by Antoine Dautry on Unsplash

After being unsatisfied with various online Sample Size calculators, we decided to build our own: https://www.statsig.com/calculator. Unlike other calculators, ours can handle different group proportions (e.g. 20/80 ratios) and we’re excited to share our methodology in this blog post.

We hope that sharing our calculations also solves two broader problems:

  • Transparency: Online calculators give different answers and do not explain why. This makes them hard to evaluate. As a data scientist, I couldn’t stake my reputation on any of these.
  • Education: I struggled to find a solid explanation for how to calculate sample sizes from first principles. To me, most online resources failed to explain which cases they cover and how to use their formulas (eg. does σ mean standard error or standard deviation? Is this for the test or control, or is this pooled?). They also only considered proportion metrics (eg. click-through rates) which ignores a lot of other important metrics (eg. time-spent and payment amount).

For me, the final straw was not finding any solution to calculating sample sizes for imbalanced tests (eg. 20% test and 80% control) which are becoming the norm. Deriving these equations was fun and I’m happy to share them here.

Statsig’s A/B Sample Size Calculator (https://statsig.com/calculator)

Why Calculating Sample Sizes is Critical

Calculating the required sample size for an A/B Test (also known as a split test or bucket test) helps you run a properly powered experiment. Just like you would check your gas tank before a road trip, you want to check your sample sizes before launching an experiment. Too few samples (users) and you won’t likely see the effect. Too many samples and you risk exposing a bad test to too many users for too long while slowing your team’s pace.

Setting Your Test Up for Success

As A/B testing is subject to randomness, we need to limit two types of errors:

1. Type I errors are the chance that given no experimental effect we will still detect a statistical difference through randomness. This is referred to as significance level () and is commonly set to 0.05 (or a 5% chance).

2. Type II errors are the chance (β) that a real effect (MDE) won’t show significant results. Power (1-β) is the chance that a real effect will produce significant results. We commonly set to power to 0.8 (80%) and β=0.2.

Minimum Detectable Effect (MDE)

MDE is the smallest effect you want to observe in an experiment. While any positive effect is good news, smaller effects are harder to measure and require more samples and time. Lengthy and large tests with small effects are generally not worth running. Knowing this helps you set a reasonable MDE and determine whether an experiment is worthwhile.

Calculating Sample Sizes

This is the standard framework for understanding the distribution of outcomes in an A/B test and calculating sample sizes. The above leads to the following formula:

Where:

  • H0 is the mean result we expect the control and test groups to produce if there’s no experimental effect.
  • MDE is the minimum detectable effect, or what we expect (at a minimum) the test group to show.
  • Z(1-⍺) is the number of standard errors from H0 (assuming no effect) that we set the significance level to. I prefer one-sided tests as they give actionable results, but if you are running a two-sided test, you want to use Z(1-⍺/2) instead.
  • Z(1-β) is the number of standard errors from MDE (given an effect). This is our safety buffer to minimize Type II errors.
  • SE(H0) is the standard error of the difference between test and control in the situation that there is no difference.
  • SE(MDE) is the standard error of the difference between test and control when the test group has the minimum detectable effect.

Solving this requires solving two different standard errors, SE(H0) and SE(MDE). Each standard error reflects the distribution of a comparison (between A and B). Whenever we compare two groups, the resulting variance is the sum of each group’s variance. SE(H0) is the standard error of the null hypothesis, H0 (no effect). Since there’s no effect, both group A and B have the same standard deviation, σ₀.

SE(MDE) is the standard error of the minimum detectable effect. The MDE (test group) likely has a different standard deviation, but it’s mathematically convenient to assume they are the same. This simplification is fairly accurate for small test effects which is when measuring sample sizes is the most critical (large effects require less samples and don’t generally have power concerns). In the end, this leads to SE(H0)SE(MDE).

I’ll save you the tedious algebra, but further simplification produces:

We have an estimate for every term here except that radical expression with nA and nB. Solving this would produce a relationship between nA and nB. What’s more useful is specifying a split ratio (r). For the canonical 50/50 test, r = 0.5. But it’s common for product teams to ship a feature to 10% of users in cases where you want to be cautious (r=0.1), or a 90/10 test if you want to broadly ship a feature, but still want to measure its effect (r=0.9). We can replace nA and nB with ratio (r) and total samples (nTotal = nA + nB) as follows:

Solving for nTotal produces our final answer:

This formula has many advantages over what you may find elsewhere:

  • You can calculate A/B tests with unequal test vs control groups. Most online calculators assume you want to run a 50/50 test which is becoming less common.
  • You can provide your own standard deviation. Most online calculators assume you have a proportion metric (like click-through rate or payer ratios) because it’s common but also makes the calculation easier. The above formula will work for non-Bernoulli metrics like timespent or average payment amount.

If you have a proportion metric, go ahead and use:

Otherwise, you can estimate standard deviation from your current data.

Interested in automating the set-up and analysis of your A/B tests? Check us out at https://www.statsig.com or feel free to contact me at Tim@statsig.com. Let me know if you’ve found this useful or have any questions about experimentation. May all your tests be properly powered.


Try Statsig Today

Explore Statsig’s smart feature gates with built-in A/B tests, or create an account instantly and start optimizing your web and mobile applications. You can also schedule a live demo or chat with us to design a custom package for your business.

MORE POSTS

Recently published

My Summer as a Statsig Intern

RIA RAJAN

This summer I had the pleasure of joining Statsig as their first ever product design intern. This was my first college internship, and I was so excited to get some design experience. I had just finished my freshman year in college and was still working on...

Read more

Long-live the 95% Confidence Interval

TIMOTHY CHAN

The 95% confidence interval currently dominates online and scientific experimentation; it always has. Yet it’s validity and usefulness is often questioned. It’s called too conservative by some [1], and too permissive by others. It’s deemed arbitrary...

Read more

Realtime Product Observability with Apache Druid

JASON WANG

Statsig’s Journey with Druid This is the text version of the story that we shared at Druid Summit Seattle 2022. Every feature we build at Statsig serves a common goal — to help you better know about your product, and empower you to make good decisions for...

Read more

Quant vs. Qual

MARGARET-ANN SEGER

💡 How to decide between leaning on data vs. research when diagnosing and solving product problems Four heuristics I’ve found helpful when deciding between data vs. research to diagnose + solve a problem. Earth image credit of Moncast Drawing. As a PM, data...

Read more

The Importance of Default Values

TORE

Have you ever sent an email to the wrong person? Well I have. At work. From a generic support email address. To a group of our top customers. Facepalm. In March of 2018, I was working on the games team at Facebook. You may remember that month as a tumultuous...

Read more
ANNOUNCEMENT

CUPED on Statsig

CRAIG

Run experiments with more speed and accuracy We’re pleased to announce the rollout of CUPED for all our customers. Statsig will now automatically use CUPED to reduce variance and bias on experiments’ key metrics. This gives you access to a powerful experiment...

Read more

We use cookies to ensure you get the best experience on our website.

Privacy Policy