7 Ways Experiments Break

Anu Sharma
Wed Dec 08 2021
DATA-SCIENCE EXPERIENCE-DESIGN TESTING STATISTICS SCIENCE

Common mistakes to avoid when you’re getting started with experimentation

Getting started with experimentation is a bit like getting started with authentication. It’s not difficult and you have a sense that you’ll figure it out. But just like with authentication, mistakes with experiments can be costly. Here is a short list of common mistakes to watch out for with experimentation.

1. Setting assignments in the wrong place

A common mistake in designing experiments is setting assignments too early. Ideally, you want to set the assignment at the point you need to render the experience. Setting the assignment too early can lead to more neutral or inconclusive experiments. For example, 10 visitors land on your product page and 1 out of these 10 visitors goes from the product page to the pricing page. If you’re running an experiment on the pricing page, make the assignment when the visitor reaches the pricing page, not when they land on the product page.

2. Missing instrumentation

Have you ever heard yourself say, ‘I’m sure this action is tracked’, and then realized after starting the experiment that it’s actually not? Me too! This means I’ve to go back, add the instrumentation, and restart the experiment. There go 2 of the 14 days I’d allocated to this experiment. As you test your instrumentation, check for missing events and missing data within events (especially unit identifiers).

3. Under-powering the experiment

Statistical power is the probability of detecting a true effect. When an experiment has low power, a true effect is hard to find. This can lead to a statistically significant result that’s a false positive rather than a true effect. To ensure the experiment is sufficiently powered, have the patience to let the experiment run and achieve the required sample size!

4. Being trapped in the present

Low powered experiments can also overestimate strength of the effect (assuming it’s a true effect). As Pinterest discovered, this can lead to engagement bias, where engaged users show up first and dominate the experiment results. “If you trust the short-term results without accounting for and trying to mitigate this bias, you risk being trapped in the present: building a product for the users you’ve already activated instead of the users you want to activate in the future.” To avoid getting trapped by engagement bias, try different experiments for users in different stages.

5. Limiting your scope

Folks getting started with experimentation frequently associate experiments primarily with the ‘growth’ function in the company that focuses on signing up new users. Broadening your scope to connect more users to the core value of your app can open up a lot more surface area for experimentation. For example, Netflix found that it has 90 seconds before viewers abandon the service, making personalization experiments incredibly valuable to their engagement and retention metrics. [Question: Do you know when new users experience moments of joy in your app?]

6. Experimenting without a plan

A lot of website optimization work focuses on changing button colors and moving the chairs around the Titanic. Throwing stuff on the wall to see what sticks isn’t a plan. Tie your experiments to your product strategy. For example, if you know latency is important to your product engagement, but don’t know to what extent, test your hypothesis and let the data define your product strategy. For example, Facebook learned that engagement increases significantly with faster message delivery. Using this data, they rebuilt the Messenger app from the ground up to start twice as fast, focusing on core features and stripping away the rest with Project Lightspeed.

7. Focusing on small ideas

While not as bad as experimenting without a plan, a related trap is focusing on small tweaks that lead to small results. Testing for small improvements also tends to need a larger sample size to be statistically significant. Focus on the intersection of low hanging fruit and high impact. As more people in the organization realize that the cost of an incremental experiment is approaching zero¹, they’ll naturally want to turn every feature into an A/B test like Uber and Booking.com.

Summary of Mitigations

To recap, there are two types of mitigations for mistakes in experimentation.

  1. Tactically, you want to assign users at the right point, instrument each user action, and let the experiment run.
  2. Strategically, you want to run experiments for different cohorts, identify the moments of joy in your app, tie experiments to a strategic objective, find opportunities for higher impact, and encourage rapid, ubiquitous experimentation.

If you’re seeing the journey from having no plan -> having a strategy -> testing every feature, you’re already way ahead of most people 👀

Join the Statsig Slack channel to discuss and get help with your experiments. Whether you’re running on Statsig or not, we want to see your experiments succeed!

Bonus

How can it be a real list be without a few bonus features, you ask?

Ok, here are moar mistakes…

  1. Having tunnel vision: In the early days of experimentation, you might hear folks on your team say, ‘you’re only looking at data’ or ‘you’re only looking at one set of metrics’. Experimentation isn’t just about getting data to make a decision, it’s about forming the full picture. Use experiment results with other quantitative and qualitative data to form the picture for your business.
  2. Missing the learning: Whether an experiment yields statsig results or not, it’s fertile ground to generate new hypotheses and insights. In this example, Roblox wanted to determine the causal impact of their Avatar Shop on community engagement and found their missing piece in an experiment that they’d run a year ago!
  3. Burning out a non-measurable resource: You can avoid unintentionally burning out your valuable resources using guardrail metrics. Say you’ve discovered new channel for push notifications, and it is showing great results in improving engagement. However, you’ll burn out the channel with excessive notifications if you overuse it. You may set up a guardrail threshold for push notifications to achieve >8%+ CTR before ramping up on the channel. If your experiment is missing guardrail metrics, ask yourself: What trade-off am I missing? How can I model that trade-off as a guardrail metric?

[1] With the right experimentation platform, you can run thousands of experiments without worrying about the escalating grunt work of managing data pipelines or the infrastructure cost of processing the reams of data everyday. The ideal experimentation platform will also ensure that these thousands of experiments are organized to run clear of each other without impacting each others’ results.


Try Statsig Today

Explore Statsig’s smart feature gates with built-in A/B tests, or create an account instantly and start optimizing your web and mobile applications. You can also schedule a live demo or chat with us to design a custom package for your business.

MORE POSTS

Recently published

My Summer as a Statsig Intern

RIA RAJAN

This summer I had the pleasure of joining Statsig as their first ever product design intern. This was my first college internship, and I was so excited to get some design experience. I had just finished my freshman year in college and was still working on...

Read more

Long-live the 95% Confidence Interval

TIMOTHY CHAN

The 95% confidence interval currently dominates online and scientific experimentation; it always has. Yet it’s validity and usefulness is often questioned. It’s called too conservative by some [1], and too permissive by others. It’s deemed arbitrary...

Read more

Realtime Product Observability with Apache Druid

JASON WANG

Statsig’s Journey with Druid This is the text version of the story that we shared at Druid Summit Seattle 2022. Every feature we build at Statsig serves a common goal — to help you better know about your product, and empower you to make good decisions for...

Read more

Quant vs. Qual

MARGARET-ANN SEGER

💡 How to decide between leaning on data vs. research when diagnosing and solving product problems Four heuristics I’ve found helpful when deciding between data vs. research to diagnose + solve a problem. Earth image credit of Moncast Drawing. As a PM, data...

Read more

The Importance of Default Values

TORE

Have you ever sent an email to the wrong person? Well I have. At work. From a generic support email address. To a group of our top customers. Facepalm. In March of 2018, I was working on the games team at Facebook. You may remember that month as a tumultuous...

Read more
ANNOUNCEMENT

CUPED on Statsig

CRAIG

Run experiments with more speed and accuracy We’re pleased to announce the rollout of CUPED for all our customers. Statsig will now automatically use CUPED to reduce variance and bias on experiments’ key metrics. This gives you access to a powerful experiment...

Read more

We use cookies to ensure you get the best experience on our website.

Privacy Policy