Reading Experimentation Tea Leaves

Vineeth Madhusudanan
Wed Mar 09 2022
EXPERIMENT A-B-TESTING EXPERIMENTATION

Making experiment readouts trustworthy

As we talk to many customers, one thing is clear : many common beliefs for product experimentation are plain wrong. These are often theoretical, academic and outdated. They lead teams to believe in a false sense of precision and waste months waiting on a handful of experiments.

Companies like Spotify, Airbnb, Amazon and Facebook (now Meta) internally built a practical tribal knowledge to running experiments. They’ve carefully tweaked traditional academic recommendations to run 10x more parallel experiments. They bias for speed of learning with their product experimentation.

The experimentation gap is real (article)

Peeking behind the curtain

We built Statsig to help companies cross this experimentation gap. You ship a feature and can be at looking at statistically significant results the very next day!

With great power though comes great responsibility

When you move fast, it’s possible to read results in ways that misrepresent what actually happened. I wanted to write up some of this tribal knowledge large companies have — that helps you extract learning you should act on. You too can optimize for speed of learning, while avoiding common pitfalls.

Yay — it’s the weekend!

Many products see very different usage patterns over weekdays vs weekends. If Netflix only looked at experiment data from the weekend they’d bias toward weekend warriors. If they only looked at data from a weekday, they might bias toward people who don’t have weekday jobs. Using a full week (and then multiples of a week) when making decisions avoids these biases.

It’s easy to identify weekends on most usage charts. It’s often easy to tell the type of app too (e.g. games peak on the weekend, whereas productivity apps peak mid week)

Looking at data for short periods of time skews results toward what your most active users tell you. eg. Laura uses Youtube every day, while Amy uses Youtube once a week. Laura’s use of Youtube is much more likely to count in experiments that have only a day’s worth of data.

Seasonality and spiky events can introduce other kinds of biases — It’s not the best idea to look at data right after running a Superbowl ad if you’re trying to make decisions for the rest of the year!

The Power Analysis Calculator can help you determine how long you need to run an experiment for to detect the impact you expect, but layer on these best practices to makes sure the data represents your users well.

What’s this shiny new thing?

When we add new features on Statsig, we often see a spurt of usage on the new feature. This sometimes comes from users waiting for this feature, that are glad to be able to use it. It also can come from from curious users, who’re keen to learn what the new feature does and how it works.

If we looked at Pulse results soon after starting a feature rollout, usage for the feature can be overstated. We’d conclude that features were more popular than they actually are — unless we watched for these novelty effects to wear out with time.

You can also use holdouts and backtests to measure the cumulative impact of features over a longer period of time.

Sticking a cute cat picture on a button will lift the button’s CTR short term, but is unlikely to sustain. Statsig’s Pulse results let you look at experiment impact by days since user’s first exposure to a feature to help identify novelty effects.

But does it make sense?

Statsig’s lets you see the wholistic impact of product changes to key metrics you care about. Using the default 95% confidence interval/ error margin means that there will be noise. ~1 in 20 metrics will show statistically significant changes even when there isn’t a real effect.

XKCD Comic
XKCD captures this problem. Read the companion article — “Democratizing Experimentation” to learn how to instill trustworthy experimentation practices in your teams

A few tips to make sure you’re not reacting to noise -

1. Have a hypothesis on the change you’re making (and expected impact). If you’re rolling out a bug fix to the video player in the Facebook app and see a small reduction in new user signups, it’s unlikely your bug fix is causing that. But if you are seeing an increase in video watch time, then you can usually be sure your fix is working as intended. Many practitioners apply a 95% confidence interval to metrics they’re expecting impact on, and use a 99% confidence interval on other metrics.

2. It’s ok for your hypothesis to be “I expect no impact”. When making changes in non-user facing code (eg. switching an underlying subsystem), you’re looking to validate that you’re not impacting any user facing engagement. Similarly it’s also possible for your hypothesis to be “I’m expecting a drop”. eg. You made decide that you’re going to ship a privacy enhancing feature that reduces engagement, but want to quantify this tradeoff before shipping.

2. Look for corroboration — don’t just cherry pick what you want to see. Many metrics move together and help paint a story. Eg. if app crashes and sessions/user are up, and average session duration is down it’s very likely your bug fix is driving more app crashes. If only sessions/user has increased, it’s unlikely your bug fix is causing this. The probability of seeing 2 independent metrics show statistical significant results is greatly reduced and is more likely to be signal than noise. This typically trumps the 5% false positive rate.

3. If you see impact that is material but cannot be explained, proceed with caution! Do not count on wins you’re surprised by, until you understand them.

Any figure that looks interesting or different is usually wrong — Tyman’s Law

We’ve seen a game developer swap out a software library and see an increase in gaming sessions/user when they were expecting no change. If they’d celebrated this as a success without understanding why, they’d have missed the actual cause — the new library was causing the game to crash occasionally. Users restarted the game (increasing sessions), but total time spent in the game had come down. We’ve also seen examples where unexpected data has given us new user insights. In general, when trying to understand unexpected data, your toolbox should include confirming results are reproducible (resalt and rerun) and a hypothesis driven investigation (enumerate ideas, look at data to confirm or disprove them).

Talk to us

Have a favorite insider tip with interpreting experiment results? I’d love to hear from you!


Try Statsig Today

Explore Statsig’s smart feature gates with built-in A/B tests, or create an account instantly and start optimizing your web and mobile applications. You can also schedule a live demo or chat with us to design a custom package for your business.

MORE POSTS

Recently published

My Summer as a Statsig Intern

RIA RAJAN

This summer I had the pleasure of joining Statsig as their first ever product design intern. This was my first college internship, and I was so excited to get some design experience. I had just finished my freshman year in college and was still working on...

Read more

Long-live the 95% Confidence Interval

TIMOTHY CHAN

The 95% confidence interval currently dominates online and scientific experimentation; it always has. Yet it’s validity and usefulness is often questioned. It’s called too conservative by some [1], and too permissive by others. It’s deemed arbitrary...

Read more

Realtime Product Observability with Apache Druid

JASON WANG

Statsig’s Journey with Druid This is the text version of the story that we shared at Druid Summit Seattle 2022. Every feature we build at Statsig serves a common goal — to help you better know about your product, and empower you to make good decisions for...

Read more

Quant vs. Qual

MARGARET-ANN SEGER

💡 How to decide between leaning on data vs. research when diagnosing and solving product problems Four heuristics I’ve found helpful when deciding between data vs. research to diagnose + solve a problem. Earth image credit of Moncast Drawing. As a PM, data...

Read more

The Importance of Default Values

TORE

Have you ever sent an email to the wrong person? Well I have. At work. From a generic support email address. To a group of our top customers. Facepalm. In March of 2018, I was working on the games team at Facebook. You may remember that month as a tumultuous...

Read more
ANNOUNCEMENT

CUPED on Statsig

CRAIG

Run experiments with more speed and accuracy We’re pleased to announce the rollout of CUPED for all our customers. Statsig will now automatically use CUPED to reduce variance and bias on experiments’ key metrics. This gives you access to a powerful experiment...

Read more

We use cookies to ensure you get the best experience on our website.

Privacy Policy