You Don’t Need Large Sample Sizes to Run A/B Tests

Timothy Chan
Fri Aug 13 2021
EXPERIMENTATION AB-TESTING SOFTWARE-DEVELOPMENT DATA-SCIENCE

Small Companies’ Secret Advantage in Experimentation

Photo by Girl with red hat on Unsplash

A/B testing is a well-established and powerful product development tool that has become best practice amongst big tech companies. Yet many small and medium-sized companies aren’t running A/B Tests. When asked why, they say “we aren’t Facebook/Google/Amazon, we just don’t have enough users”. Sadly, this oversimplification and misunderstanding of statistics is blocking a lot of companies from even trying the industry’s most powerful product growth tool.

Companies like Facebook, Microsoft, AirBnB, Spotify, and Netflix are the face of rapid product growth through experimentation. They run thousands of simultaneous experiments on millions of users, with microscopic wins that are worth big money (See Bing’s Shades of Blue Experiment). For the rest of us, A/B testing is an amazing but unrelatable academic exercise outside the world of startups.

What’s often ignored is that while big companies use A/B testing to optimize billion-user products like Google Search and Facebook Newsfeed, they use the same tools to build zero-to-one products, starting with as little as a thousand users and growing that into millions. I spent 5 years as a Data Scientist at Facebook and oversaw A/B tests on a wide range of features and products, helping launch and grow many new products, each starting with tests on mere “thousands” of users. A/B testing is a pillar of data-driven product development irrespective of the product’s size. And despite what you may hear, small products actually have a huge statistical advantage that’s never discussed.

A/B testing is a pillar of data-driven product development irrespective of the product’s size

Startups’ Secret Weapon in A/B Testing

Startup companies have a major advantage in experimentation: huge potential and upside. Startups don’t play the micro-optimization game; they don’t care about a +0.2% increase in click-through rates. Instead, they hunt for big wins like a 40% increase in feature adoption, or a 15% increase in signup rates. Startups have a lot of low-hanging fruit and huge opportunities. The statistical term is “effect size” and it’s actually more important than sample size in determining an experiment’s statistical power.

Data generated for a standard A/B test over 7 days on a 5% baseline conversion rate metric using a one-sided t-test (5% significance level and 80% power).

The above chart shows experiments with equal statistical power. Your success will vary depending on your specific experiment but it’s clear you do not need millions of users to measure meaningful metric lifts as small as 5%. This doesn’t even include the multitude of tricks you can use to boost statistical power (Bayesian stats, CUPED variance reduction, or extending the duration of your test).

Google vs a Startup: Who has more statistical power?

Google routinely runs large search tests on a massive scale, for example 100,000,000 users where a+0.1% effect on topline metrics is a big win. Meanwhile a typical startup with just 10,000 users may be hunting for a +15% win. Which experiment has more statistical power? You may think with 10,000 times less users, the startup has no chance. But the Z-score equation which measures statistical significance has a square root on sample size: 10,000 times less users is only 100 times less statistical power. Meanwhile the startup is looking for a 150x bigger effect, resulting in a net effect of 50% more statistical power! Contrary to popular belief, startups typically have a better chance of running a successful A/B test.

Z-scores determine whether experimental results are statistically-significant. https://en.wikipedia.org/wiki/Standard_score

Big Effects are Seldom Obvious

Small companies chasing big changes may be tempted to take a shortcut and skip out on A/B testing to save time. After all, low-hanging fruit seems obvious and we should be able to accurately measure top-line effects on our dashboards right? This is a mistake for many reasons:

  1. Changes are unpredictable in direction and magnitude. A/B testing has an interesting effect of humbling product builders and producing unexpected surprises. At Microsoft, roughly 1/3rd of all experiments had negative results. Product improvements on early-stage products tend to be high risk and high reward, and it’s critical you have a rigorous way to measure and evaluate each of them rather than have them killed by highly-paid executive. One of our customers with less than 100 daily active users had ignored straightforward improvements thinking they were just +10% wins, but when tested they were revealed to be 300% wins +/- 100%.
  2. Ecosystem effects are complex. While topline impacts might be anticipated by the experiment’s hypothesis, what’s far less obvious is secondary and ecosystem effects. One of our clients with ~1k DAU launched a new user badging experiment, a classic strategy for mobile apps. While this did indeed increase discoverability of a couple of new features what blew their mind was the dozens of unexpected side-effects showing +50% engagement wins on indirectly related features. The company immediately learned a lot about their users and came up with dozens of additional ideas.
  3. Dashboards are inferior causality measurement tools. We’ve all stared at dashboard movements attempting to “read the tea leaves” trying to identify what caused what. This is universal and while you may think launching a big feature will show up immediately on a dashboard, in practice, the root cause of any movement is far less certain. Was that boost due to our marketing campaign? A competitor’s announcement? Is seasonality masking something? Product builders are excellent at coming up with plausible explanations for why metrics may have moved, but would be fools to bet their house on them.

A/B testing is the gold standard when it comes to measuring causality and bringing evidence and numbers to the front. While it’s true it takes time to collect and measure the results, the value teams receive in knowing (and not hoping for) the exact impact and improvements is critical for making your product’s successful.

Implementing A/B testing early in a company has a way of anchoring and establishing a data-driven culture. People focus less on debates, and more on evaluating and interpreting data. Egos become marginalized, and ideas become free, not squashed.

A/B Testing is More Accessible Than Ever

While A/B Testing used to require highly specialized expertise and dedicated teams, it is far more accessible now than it was a decade ago. Feature gating/flagging/toggling is now a standard tool in modern software development, allowing you to control rollouts of features/tests. There are blogs and books devoted to A/B testing, and online communities of analytics-minded folks. While its near impossible to cover what’s out there, the following two resources stand out:

There are even open-source projects to take a look at (Intuit’s wasabi, Indeed’s proctor, and sixpack). If this is daunting, there are many companies offering A/B testing services which require only a few lines of code.

Statsig.com

Full disclosure; I’m a data scientist building Statsig’s experimentation tool. A group of ex-Facebook employees founded Statsig to put the power of A/B testing into the hands of small and medium-sized product builders. I wrote this article to bust the myth that A/B testing is only for big companies. We’ve made A/B testing accessible and you can be up and running A/B tests with no risk, no trial period, and no credit card. Try us out at www.statsig.com.


Try Statsig Today

Explore Statsig’s smart feature gates with built-in A/B tests, or create an account instantly and start optimizing your web and mobile applications. You can also schedule a live demo or chat with us to design a custom package for your business.

MORE POSTS

Recently published

My Summer as a Statsig Intern

RIA RAJAN

This summer I had the pleasure of joining Statsig as their first ever product design intern. This was my first college internship, and I was so excited to get some design experience. I had just finished my freshman year in college and was still working on...

Read more

Long-live the 95% Confidence Interval

TIMOTHY CHAN

The 95% confidence interval currently dominates online and scientific experimentation; it always has. Yet it’s validity and usefulness is often questioned. It’s called too conservative by some [1], and too permissive by others. It’s deemed arbitrary...

Read more

Realtime Product Observability with Apache Druid

JASON WANG

Statsig’s Journey with Druid This is the text version of the story that we shared at Druid Summit Seattle 2022. Every feature we build at Statsig serves a common goal — to help you better know about your product, and empower you to make good decisions for...

Read more

Quant vs. Qual

MARGARET-ANN SEGER

💡 How to decide between leaning on data vs. research when diagnosing and solving product problems Four heuristics I’ve found helpful when deciding between data vs. research to diagnose + solve a problem. Earth image credit of Moncast Drawing. As a PM, data...

Read more

The Importance of Default Values

TORE

Have you ever sent an email to the wrong person? Well I have. At work. From a generic support email address. To a group of our top customers. Facepalm. In March of 2018, I was working on the games team at Facebook. You may remember that month as a tumultuous...

Read more
ANNOUNCEMENT

CUPED on Statsig

CRAIG

Run experiments with more speed and accuracy We’re pleased to announce the rollout of CUPED for all our customers. Statsig will now automatically use CUPED to reduce variance and bias on experiments’ key metrics. This gives you access to a powerful experiment...

Read more

We use cookies to ensure you get the best experience on our website.

Privacy Policy