As we talk to many customers, one thing is clear: many common beliefs for product experimentation are plain wrong. These are often theoretical, academic and outdated. They lead teams to believe in a false sense of precision and waste months waiting on a handful of experiments.
Companies like Spotify, Airbnb, Amazon and Facebook (now Meta) internally built a practical tribal knowledge to running experiments. They’ve carefully tweaked traditional academic recommendations to run 10x more parallel experiments. They bias for speed of learning with their product experimentation.
We built Statsig to help companies cross this experimentation gap. You ship a feature and can be at looking at statistically significant results the very next day!
With great power though comes great responsibility
When you move fast, it’s possible to read results in ways that misrepresent what actually happened. I wanted to write up some of this tribal knowledge large companies have — that helps you extract learning you should act on. You too can optimize for speed of learning, while avoiding common pitfalls.
Many products see very different usage patterns over weekdays vs weekends. If Netflix only looked at experiment data from the weekend they’d bias toward weekend warriors. If they only looked at data from a weekday, they might bias toward people who don’t have weekday jobs. Using a full week (and then multiples of a week) when making decisions avoids these biases.
Looking at data for short periods of time skews results toward what your most active users tell you. eg. Laura uses Youtube every day, while Amy uses Youtube once a week. Laura’s use of Youtube is much more likely to count in experiments that have only a day’s worth of data.
Seasonality and spiky events can introduce other kinds of biases — It’s not the best idea to look at data right after running a Superbowl ad if you’re trying to make decisions for the rest of the year!
The Power Analysis Calculator can help you determine how long you need to run an experiment for to detect the impact you expect, but layer on these best practices to makes sure the data represents your users well.
When we add new features on Statsig, we often see a spurt of usage on the new feature. This sometimes comes from users waiting for this feature, that are glad to be able to use it. It also can come from from curious users, who’re keen to learn what the new feature does and how it works.
If we looked at Pulse results soon after starting a feature rollout, usage for the feature can be overstated. We’d conclude that features were more popular than they actually are — unless we watched for these novelty effects to wear out with time.
You can also use holdouts and backtests to measure the cumulative impact of features over a longer period of time.
Statsig’s lets you see the wholistic impact of product changes to key metrics you care about. Using the default 95% confidence interval/ error margin means that there will be noise. ~1 in 20 metrics will show statistically significant changes even when there isn’t a real effect.
A few tips to make sure you’re not reacting to noise -
1. Have a hypothesis on the change you’re making (and expected impact). If you’re rolling out a bug fix to the video player in the Facebook app and see a small reduction in new user signups, it’s unlikely your bug fix is causing that. But if you are seeing an increase in video watch time, then you can usually be sure your fix is working as intended. Many practitioners apply a 95% confidence interval to metrics they’re expecting impact on, and use a 99% confidence interval on other metrics.
2. It’s ok for your hypothesis to be “I expect no impact”. When making changes in non-user facing code (eg. switching an underlying subsystem), you’re looking to validate that you’re not impacting any user facing engagement. Similarly it’s also possible for your hypothesis to be “I’m expecting a drop”. eg. You made decide that you’re going to ship a privacy enhancing feature that reduces engagement, but want to quantify this tradeoff before shipping.
2. Look for corroboration — don’t just cherry pick what you want to see. Many metrics move together and help paint a story. Eg. if app crashes and sessions/user are up, and average session duration is down it’s very likely your bug fix is driving more app crashes. If only sessions/user has increased, it’s unlikely your bug fix is causing this. The probability of seeing 2 independent metrics show statistical significant results is greatly reduced and is more likely to be signal than noise. This typically trumps the 5% false positive rate.
3. If you see impact that is material but cannot be explained, proceed with caution! Do not count on wins you’re surprised by, until you understand them.
We’ve seen a game developer swap out a software library and see an increase in gaming sessions/user when they were expecting no change. If they’d celebrated this as a success without understanding why, they’d have missed the actual cause — the new library was causing the game to crash occasionally. Users restarted the game (increasing sessions), but total time spent in the game had come down. We’ve also seen examples where unexpected data has given us new user insights. In general, when trying to understand unexpected data, your toolbox should include confirming results are reproducible (resalt and rerun) and a hypothesis driven investigation (enumerate ideas, look at data to confirm or disprove them).
Have a favorite insider tip with interpreting experiment results? I’d love to hear from you!
Detect interaction effects between concurrent A/B tests with Statsig's new feature to ensure accurate experiment results and avoid misleading metric shifts. Read More ⇾
Statsig's biggest year yet: groundbreaking launches, global events, record scaling, and exciting plans for 2025. Explore our 2024 milestones and what’s next! Read More ⇾
A guide to reporting A/B test results: What are common mistakes and how can you make sure to get it right? Read More ⇾
Understand the difference between one-tailed and two-tailed tests. This guide will help you choose between using a one-tailed or two-tailed hypothesis! Read More ⇾
This guide explains why the allocation point may differ from the exposure point, how it happens, and what you to do about it. Read More ⇾
From continuous integration and deployment to a scrappy, results-driven mindset, learn how we prioritize speed and precision to deliver results quickly and safely Read More ⇾