Products

Solutions

Resources

Docs Pricing

Products

Solutions

Resources

Products

Solutions

Resources

Reading Experimentation Tea Leaves

Wed Mar 09 2022

Making experiment readouts trustworthy

As we talk to many customers, one thing is clear: many common beliefs for product experimentation are plain wrong. These are often theoretical, academic and outdated. They lead teams to believe in a false sense of precision and waste months waiting on a handful of experiments.

Companies like Spotify, Airbnb, Amazon and Facebook (now Meta) internally built a practical tribal knowledge to running experiments. They’ve carefully tweaked traditional academic recommendations to run 10x more parallel experiments. They bias for speed of learning with their product experimentation.

sophisticated experiments versus everyone else, a few experiments a month

Peeking behind the curtain

We built Statsig to help companies cross this experimentation gap. You ship a feature and can be at looking at statistically significant results the very next day!

With great power though comes great responsibility

When you move fast, it’s possible to read results in ways that misrepresent what actually happened. I wanted to write up some of this tribal knowledge large companies have — that helps you extract learning you should act on. You too can optimize for speed of learning, while avoiding common pitfalls.

Yay — it’s the weekend!

Many products see very different usage patterns over weekdays vs weekends. If Netflix only looked at experiment data from the weekend they’d bias toward weekend warriors. If they only looked at data from a weekday, they might bias toward people who don’t have weekday jobs. Using a full week (and then multiples of a week) when making decisions avoids these biases.

weekends are easy to identify on most usage charts

It’s easy to identify weekends on most usage charts. It’s often easy to tell the type of app too (e.g. games peak on the weekend, whereas productivity apps peak mid week)

Looking at data for short periods of time skews results toward what your most active users tell you. eg. Laura uses Youtube every day, while Amy uses Youtube once a week. Laura’s use of Youtube is much more likely to count in experiments that have only a day’s worth of data.

Seasonality and spiky events can introduce other kinds of biases — It’s not the best idea to look at data right after running a Superbowl ad if you’re trying to make decisions for the rest of the year!

The Power Analysis Calculator can help you determine how long you need to run an experiment for to detect the impact you expect, but layer on these best practices to makes sure the data represents your users well.

What’s this shiny new thing?

When we add new features on Statsig, we often see a spurt of usage on the new feature. This sometimes comes from users waiting for this feature, that are glad to be able to use it. It also can come from from curious users, who’re keen to learn what the new feature does and how it works.

If we looked at Pulse results soon after starting a feature rollout, usage for the feature can be overstated. We’d conclude that features were more popular than they actually are — unless we watched for these novelty effects to wear out with time.

You can also use holdouts and backtests to measure the cumulative impact of features over a longer period of time.

acute novelty effects with feature launch versus sustained lift caused by feature graph

Sticking a cute cat picture on a button will lift the button’s CTR short term, but is unlikely to sustain. Statsig’s Pulse results let you look at experiment impact by days since user’s first exposure to a feature to help identify novelty effects.

But does it make sense?

Statsig’s lets you see the wholistic impact of product changes to key metrics you care about. Using the default 95% confidence interval/ error margin means that there will be noise. ~1 in 20 metrics will show statistically significant changes even when there isn’t a real effect.

XKCD captures this problem. Read the companion article — “Democratizing Experimentation” to learn how to instill trustworthy experimentation practices in your teams

A few tips to make sure you’re not reacting to noise -

1. Have a hypothesis on the change you’re making (and expected impact). If you’re rolling out a bug fix to the video player in the Facebook app and see a small reduction in new user signups, it’s unlikely your bug fix is causing that. But if you are seeing an increase in video watch time, then you can usually be sure your fix is working as intended. Many practitioners apply a 95% confidence interval to metrics they’re expecting impact on, and use a 99% confidence interval on other metrics.

2. It’s ok for your hypothesis to be “I expect no impact”. When making changes in non-user facing code (eg. switching an underlying subsystem), you’re looking to validate that you’re not impacting any user facing engagement. Similarly it’s also possible for your hypothesis to be “I’m expecting a drop”. eg. You made decide that you’re going to ship a privacy enhancing feature that reduces engagement, but want to quantify this tradeoff before shipping.

2. Look for corroboration — don’t just cherry pick what you want to see. Many metrics move together and help paint a story. Eg. if app crashes and sessions/user are up, and average session duration is down it’s very likely your bug fix is driving more app crashes. If only sessions/user has increased, it’s unlikely your bug fix is causing this. The probability of seeing 2 independent metrics show statistical significant results is greatly reduced and is more likely to be signal than noise. This typically trumps the 5% false positive rate.

3. If you see impact that is material but cannot be explained, proceed with caution! Do not count on wins you’re surprised by, until you understand them.

a fish looking at a sign that says free lip piercing

Any figure that looks interesting or different is usually wrong — Tyman’s Law

We’ve seen a game developer swap out a software library and see an increase in gaming sessions/user when they were expecting no change. If they’d celebrated this as a success without understanding why, they’d have missed the actual cause — the new library was causing the game to crash occasionally. Users restarted the game (increasing sessions), but total time spent in the game had come down. We’ve also seen examples where unexpected data has given us new user insights. In general, when trying to understand unexpected data, your toolbox should include confirming results are reproducible (resalt and rerun) and a hypothesis driven investigation (enumerate ideas, look at data to confirm or disprove them).

Talk to us

Have a favorite insider tip with interpreting experiment results? I’d love to hear from you!

Permalink: https://www.statsig.com/blog/reading-experimentation-tea-leaves

Products

Solutions

Resources

Products

Solutions

Resources

Docs

Pricing

Back to Blog home

Vineeth Madhusudanan

Reading Experimentation Tea Leaves

Making experiment readouts trustworthy

Peeking behind the curtain

Yay — it’s the weekend!

What’s this shiny new thing?

But does it make sense?

Talk to us

Recent Posts

Sink, swim, or scale: What startups teach us about launching AI

Alexey Komissarouk, Yuzheng Sun, PhD

Optimizing cloud compute costs with GKE and compute classes

Pablo Beltran

How Statsig lets you ship, measure, and optimize AI-generated code

Sid Kumar, Brock Lumbard

Your users are your best benchmark: a guide to testing and optimizing AI products

Skye Scofield

The more the merrier? The problem of multiple comparisons in A/B Testing

Allon Korem, Oryah Lancry-Dayan

Randomization: The ABC’s of A/B Testing

Allon Korem, Oryah Lancry-Dayan