A/B testing is a well-established and powerful product development tool that has become best practice amongst big tech companies. Yet many small and medium-sized companies aren’t running A/B Tests. When asked why, they say “we aren’t Facebook/Google/Amazon, we just don’t have enough users”. Sadly, this oversimplification and misunderstanding of statistics is blocking a lot of companies from even trying the industry’s most powerful product growth tool.
Companies like Facebook, Microsoft, AirBnB, Spotify, and Netflix are the face of rapid product growth through experimentation. They run thousands of simultaneous experiments on millions of users, with microscopic wins that are worth big money (See Bing’s Shades of Blue Experiment). For the rest of us, A/B testing is an amazing but unrelatable academic exercise outside the world of startups.
What’s often ignored is that while big companies use A/B testing to optimize billion-user products like Google Search and Facebook Newsfeed, they use the same tools to build zero-to-one products, starting with as little as a thousand users and growing that into millions. I spent 5 years as a Data Scientist at Facebook and oversaw A/B tests on a wide range of features and products, helping launch and grow many new products, each starting with tests on mere “thousands” of users. A/B testing is a pillar of data-driven product development irrespective of the product’s size. And despite what you may hear, small products actually have a huge statistical advantage that’s never discussed.
A/B testing is a pillar of data-driven product development irrespective of the product’s size
Startup companies have a major advantage in experimentation: huge potential and upside. Startups don’t play the micro-optimization game; they don’t care about a +0.2% increase in click-through rates. Instead, they hunt for big wins like a 40% increase in feature adoption, or a 15% increase in signup rates. Startups have a lot of low-hanging fruit and huge opportunities. The statistical term is “effect size” and it’s actually more important than sample size in determining an experiment’s statistical power.
The above chart shows experiments with equal statistical power. Your success will vary depending on your specific experiment but it’s clear you do not need millions of users to measure meaningful metric lifts as small as 5%. This doesn’t even include the multitude of tricks you can use to boost statistical power (Bayesian stats, CUPED variance reduction, or extending the duration of your test).
Google routinely runs large search tests on a massive scale, for example 100,000,000 users where a+0.1% effect on topline metrics is a big win. Meanwhile a typical startup with just 10,000 users may be hunting for a +15% win. Which experiment has more statistical power? You may think with 10,000 times less users, the startup has no chance. But the Z-score equation which measures statistical significance has a square root on sample size: 10,000 times less users is only 100 times less statistical power. Meanwhile the startup is looking for a 150x bigger effect, resulting in a net effect of 50% more statistical power! Contrary to popular belief, startups typically have a better chance of running a successful A/B test.
Small companies chasing big changes may be tempted to take a shortcut and skip out on A/B testing to save time. After all, low-hanging fruit seems obvious and we should be able to accurately measure top-line effects on our dashboards right? This is a mistake for many reasons:
Changes are unpredictable in direction and magnitude. A/B testing has an interesting effect of humbling product builders and producing unexpected surprises. At Microsoft, roughly 1/3rd of all experiments had negative results. Product improvements on early-stage products tend to be high risk and high reward, and it’s critical you have a rigorous way to measure and evaluate each of them rather than have them killed by highly-paid executive. One of our customers with less than 100 daily active users had ignored straightforward improvements thinking they were just +10% wins, but when tested they were revealed to be 300% wins +/- 100%.
Ecosystem effects are complex. While topline impacts might be anticipated by the experiment’s hypothesis, what’s far less obvious is secondary and ecosystem effects. One of our clients with ~1k DAU launched a new user badging experiment, a classic strategy for mobile apps. While this did indeed increase discoverability of a couple of new features what blew their mind was the dozens of unexpected side-effects showing +50% engagement wins on indirectly related features. The company immediately learned a lot about their users and came up with dozens of additional ideas.
Dashboards are inferior causality measurement tools. We’ve all stared at dashboard movements attempting to “read the tea leaves” trying to identify what caused what. This is universal and while you may think launching a big feature will show up immediately on a dashboard, in practice, the root cause of any movement is far less certain. Was that boost due to our marketing campaign? A competitor’s announcement? Is seasonality masking something? Product builders are excellent at coming up with plausible explanations for why metrics may have moved, but would be fools to bet their house on them.
A/B testing is the gold standard when it comes to measuring causality and bringing evidence and numbers to the front. While it’s true it takes time to collect and measure the results, the value teams receive in knowing (and not hoping for) the exact impact and improvements is critical for making your product’s successful.
Implementing A/B testing early in a company has a way of anchoring and establishing a data-driven culture. People focus less on debates, and more on evaluating and interpreting data. Egos become marginalized, and ideas become free, not squashed.
While A/B Testing used to require highly specialized expertise and dedicated teams, it is far more accessible now than it was a decade ago. Feature gating/flagging/toggling is now a standard tool in modern software development, allowing you to control rollouts of features/tests. There are blogs and books devoted to A/B testing, and online communities of analytics-minded folks. While its near impossible to cover what’s out there, the following two resources stand out:
Evan Miller’s A/B Testing Tools: https://www.evanmiller.org/ab-testing/
There are even open-source projects to take a look at (Intuit’s wasabi, Indeed’s proctor, and sixpack). If this is daunting, there are many companies offering A/B testing services which require only a few lines of code.
Full disclosure; I’m a data scientist building Statsig’s experimentation tool. A group of ex-Facebook employees founded Statsig to put the power of A/B testing into the hands of small and medium-sized product builders. I wrote this article to bust the myth that A/B testing is only for big companies. We’ve made A/B testing accessible and you can be up and running A/B tests with no risk, no trial period, and no credit card. Try us out at www.statsig.com.
A/B testing serves to continually enhance product experiences and foster innovation—a process that is beneficial to all, even designers.
Metrics Explorer promises to redefine how you interact with your metrics by providing more analytics power directly within the Statsig platform.
A lack of “fairness” is called SRM or sample ratio mismatch. For instance, flipping a biased coin that doesn't land on heads and tails equally.
Confining experimentation to UI and website optimization falls short of capturing the remarkable business potential that experimentation can unlock.
Late Friday, August 25, some of our customers started reporting that our Feature Flags were not resolving correctly with the JS SDK on the latest versions of Chrome.
We conducted a virtual AMA with Ronny Kohavi and our very own Tim Chan. Check out the recording.