Top 8 common experimentation mistakes and how to fix them

Thu Jul 18 2024

Skye Scofield

Head of Business Operations, Statsig

I recently down with Allon Korem, CEO of Bell Statistics, and Tyler VanHaren, Software Engineer at Statsig, to discuss some of the most frequent mistakes companies can make in A/B testing and experimentation! I've summarized the discussion and outlined the 8 common experimentation mistakes and how to fix them.

1. Data integrity: Ensure that your allocation point is consistent and verify your distributions using chi-squared tests to detect sample ratio mismatches. 

Data integrity is crucial for accurate A/B testing, but it’s often mishandled. Tyler pointed out a common mistake in the setup phase, where inconsistencies in recording user experiences lead to sample ratio mismatch (SRM). This happens when the intended 50/50 test shows a 60/40 distribution due to underreporting or technical issues. 

See our blog on Sample Ratio Mismatch 

2. Skepticism and Vigilance: Regularly check data integrity over different segments and time periods to identify inconsistencies early. 

Allon emphasized the importance of being skeptical about data integrity. He recounted an instance where a friend's test results seemed suspicious, showing no initial difference between groups, followed by a sudden gap. This highlights the necessity of continuously monitoring data over time. 

3. Proper Metrics: Collaborate with data science teams to ensure metrics are correctly defined and measured, focusing on meaningful business-driven KPIs. 

Choosing and accurately measuring the right metrics is essential. Tyler mentioned issues where specific user groups, like logged-out users, skew data due to improper representation. 

4. Statistical Methods: Use t-tests for means and z-tests for proportions in most cases. Ensure your statistical tests are relevant to your hypotheses. 

Using improper statistical methods can lead to misleading results. Allon discussed the pitfalls of not performing statistical tests or using inappropriate tests like the Mann-Whitney U test for mean comparisons. 

5. Peeking: Use sequential testing approaches to manage peeking. Tools like Statsig provide inflated confidence intervals for early data to mitigate premature conclusions. 

Peeking at data during a test inflates the false positive rate. Tyler highlighted the human temptation to peek, driven by curiosity or early signs of performance changes.    Mitigrating the impact of data peeking in double-bling experimentation 

6. Underpowered Tests: Plan tests meticulously using power analysis calculators to ensure you have sufficient data to detect the expected changes. 

Running underpowered tests is common due to insufficient sample sizes. Allon noted that improper planning often leads to tests that can't detect meaningful changes. 

7. Handling Outliers: Use Windsorization to cap extreme values rather than removing outliers entirely, maintaining the integrity of your data. 

Outliers can distort test results. While it's important to manage outliers to avoid false positives, Allon advised against removing them outright. 

8. Cultural Challenges: Foster a culture that encourages upfront hypothesis formulation and continuous learning from experimentation. 

Beyond technical issues, cultural challenges can hinder effective experimentation. Tyler stressed the importance of building a culture of hypothesis-driven testing and quick, consistent execution. 

By addressing these common testing mistakes, companies can significantly improve the accuracy and reliability of their A/B tests. These steps will help you make more informed decisions and drive better business outcomes. Feel free to reach out with any questions or comments. Let's continue the conversation on how to enhance your testing strategies! 

Create a free account

You're invited to create a free Statsig account! Get started today with 2M free events. No credit card required, of course.
an enter key that says "free account"

Build fast?

Subscribe to Scaling Down: Our newsletter on building at startup-speed.

Try Statsig Today

Get started for free. Add your whole team!
We use cookies to ensure you get the best experience on our website.
Privacy Policy