Getting started with experimentation is a bit like getting started with authentication. It’s not difficult and you have a sense that you’ll figure it out. But just like with authentication, mistakes with experiments can be costly. Here is a short list of common mistakes to watch out for with experimentation.
A common mistake in designing experiments is setting assignments too early. Ideally, you want to set the assignment at the point you need to render the experience. Setting the assignment too early can lead to more neutral or inconclusive experiments. For example, 10 visitors land on your product page and 1 out of these 10 visitors goes from the product page to the pricing page. If you’re running an experiment on the pricing page, make the assignment when the visitor reaches the pricing page, not when they land on the product page.
Have you ever heard yourself say, ‘I’m sure this action is tracked’, and then realized after starting the experiment that it’s actually not? Me too! This means I’ve to go back, add the instrumentation, and restart the experiment. There go 2 of the 14 days I’d allocated to this experiment. As you test your instrumentation, check for missing events and missing data within events (especially unit identifiers).
Statistical power is the probability of detecting a true effect. When an experiment has low power, a true effect is hard to find. This can lead to a statistically significant result that’s a false positive rather than a true effect. To ensure the experiment is sufficiently powered, have the patience to let the experiment run and achieve the required sample size!
Low powered experiments can also overestimate strength of the effect (assuming it’s a true effect). As Pinterest discovered, this can lead to engagement bias, where engaged users show up first and dominate the experiment results. “If you trust the short-term results without accounting for and trying to mitigate this bias, you risk being trapped in the present: building a product for the users you’ve already activated instead of the users you want to activate in the future.” To avoid getting trapped by engagement bias, try different experiments for users in different stages.
Folks getting started with experimentation frequently associate experiments primarily with the ‘growth’ function in the company that focuses on signing up new users. Broadening your scope to connect more users to the core value of your app can open up a lot more surface area for experimentation. For example, Netflix found that it has 90 seconds before viewers abandon the service, making personalization experiments incredibly valuable to their engagement and retention metrics. [Question: Do you know when new users experience moments of joy in your app?]
A lot of website optimization work focuses on changing button colors and moving the chairs around the Titanic. Throwing stuff on the wall to see what sticks isn’t a plan. Tie your experiments to your product strategy. For example, if you know latency is important to your product engagement, but don’t know to what extent, test your hypothesis and let the data define your product strategy. For example, Facebook learned that engagement increases significantly with faster message delivery. Using this data, they rebuilt the Messenger app from the ground up to start twice as fast, focusing on core features and stripping away the rest with Project Lightspeed.
While not as bad as experimenting without a plan, a related trap is focusing on small tweaks that lead to small results. Testing for small improvements also tends to need a larger sample size to be statistically significant. Focus on the intersection of low hanging fruit and high impact. As more people in the organization realize that the cost of an incremental experiment is approaching zero¹, they’ll naturally want to turn every feature into an A/B test like Uber and Booking.com.
To recap, there are two types of mitigations for mistakes in experimentation.
Tactically, you want to assign users at the right point, instrument each user action, and let the experiment run.
Strategically, you want to run experiments for different cohorts, identify the moments of joy in your app, tie experiments to a strategic objective, find opportunities for higher impact, and encourage rapid, ubiquitous experimentation.
If you’re seeing the journey from having no plan -> having a strategy -> testing every feature, you’re already way ahead of most people 👀
Join the Statsig Slack channel to discuss and get help with your experiments. Whether you’re running on Statsig or not, we want to see your experiments succeed!
Ok, here are moar mistakes…
Having tunnel vision: In the early days of experimentation, you might hear folks on your team say, ‘you’re only looking at data’ or ‘you’re only looking at one set of metrics’. Experimentation isn’t just about getting data to make a decision, it’s about forming the full picture. Use experiment results with other quantitative and qualitative data to form the picture for your business.
Missing the learning: Whether an experiment yields statsig results or not, it’s fertile ground to generate new hypotheses and insights. In this example, Roblox wanted to determine the causal impact of their Avatar Shop on community engagement and found their missing piece in an experiment that they’d run a year ago!
Burning out a non-measurable resource: You can avoid unintentionally burning out your valuable resources using guardrail metrics. Say you’ve discovered new channel for push notifications, and it is showing great results in improving engagement. However, you’ll burn out the channel with excessive notifications if you overuse it. You may set up a guardrail threshold for push notifications to achieve >8%+ CTR before ramping up on the channel. If your experiment is missing guardrail metrics, ask yourself: What trade-off am I missing? How can I model that trade-off as a guardrail metric?
 With the right experimentation platform, you can run thousands of experiments without worrying about the escalating grunt work of managing data pipelines or the infrastructure cost of processing the reams of data everyday. The ideal experimentation platform will also ensure that these thousands of experiments are organized to run clear of each other without impacting each others’ results.
Thanks to our support team, our customers can feel like Statsig is a part of their org and not just a software vendor. We want our customers to know that we're here for them.
Migrating experimentation platforms is a chance to cleanse tech debt, streamline workflows, define ownership, promote democratization of testing, educate teams, and more.
Calculating the right sample size means balancing the level of precision desired, the anticipated effect size, the statistical power of the experiment, and more.
The term 'recency bias' has been all over the statistics and data analysis world, stealthily skewing our interpretation of patterns and trends.
A lot has changed in the past year. New hires, new products, and a new office (or two!) GB Lee tells the tale alongside pictures and illustrations:
A deep dive into CUPED: Why it was invented, how it works, and how to use CUPED to run experiments faster and with less bias.