Hold Out Testing

Hold out testing is a powerful experimentation technique that helps measure the long-term impact of product changes. By excluding a small subset of users from experiencing any changes, you create a control group for comparison. This method differs from traditional A/B testing, where all users are exposed to either the control or treatment.

The key difference lies in the duration and scope of the test. While A/B tests focus on immediate effects, hold out tests assess the cumulative impact of multiple changes over an extended period. This approach provides a more accurate picture of how your product evolves and affects user behavior in the long run.

Implementing hold out tests offers several benefits for ensuring the accuracy of your metrics:

  • Mitigating the risk of overestimating the impact of individual experiments

  • Identifying potential interactions between multiple experiments

  • Accounting for novelty effects and other short-term biases

  • Providing a reliable baseline for measuring overall product improvements

By comparing the metrics of your hold out group against those exposed to changes, you can gain valuable insights into the true impact of your product decisions. This knowledge empowers you to make data-driven choices that optimize for long-term success.

Setting up a hold out test

To create a hold out group, randomly select a subset of your users and exclude them from any experiments or changes. The size of your hold out group should be around 5-10% of your total user base. This provides a large enough sample for statistical significance without sacrificing too much potential uplift.

Maintaining a consistent hold out group is crucial for accurate long-term impact measurement. Use feature flags or user segmentation to ensure the same users remain in the hold out group over time. Regularly monitor the hold out group to check for any unintended changes or contamination.

When running multiple experiments concurrently, consider using mutually exclusive hold out groups for each experiment. This prevents interaction effects between experiments from skewing your results. Alternatively, you can use a shared hold out group across all experiments—just be aware of potential interactions when analyzing results.

Stratified sampling is another useful technique for creating representative hold out groups. This involves dividing your user base into distinct segments based on key characteristics (e.g., demographics, behavior) and then randomly sampling from each segment proportionally. Stratified sampling helps ensure your hold out group closely mirrors your overall user population.

As you run experiments over time, periodically refresh your hold out group to prevent bias from long-term exposure. You can do this by randomly assigning a portion of the hold out users back into the experiment population and replacing them with new users. This rotation strategy helps maintain the integrity of your hold out testing while allowing all users to eventually benefit from improvements.

Measuring the impact of changes

Comparing metrics between hold out and treatment groups is crucial for accurate impact assessment. By analyzing the differences in key performance indicators (KPIs) between these groups, you can determine the effectiveness of your changes. This comparison helps you understand whether the observed improvements are due to the implemented changes or other factors.

Techniques for analyzing long-term effects of product changes include using hold out testing over extended periods. By maintaining a hold out group for several months, you can monitor how metrics evolve over time. This approach helps identify any delayed or gradual impacts that may not be immediately apparent in short-term experiments.

Identifying unexpected interactions between multiple experiments is another important aspect of measuring change impact. When running concurrent experiments, it's essential to consider how they might influence each other. Hold out testing can help isolate the effects of individual experiments and reveal any unanticipated interactions that could skew results.

To effectively measure the impact of changes using hold out testing, consider the following:

  • Define clear KPIs that align with your business objectives and product goals. These metrics should be measurable, relevant, and sensitive to the changes you're implementing.

  • Establish a representative hold out group that closely mirrors your target audience. Ensure that this group is large enough to provide statistically significant results and is not exposed to any of the changes being tested.

  • Monitor metrics over time to identify trends and patterns. Regularly compare the performance of the hold out group against the treatment groups to assess the long-term impact of your changes.

  • Use statistical analysis to determine the significance of observed differences between groups. Apply appropriate statistical tests and models to validate your findings and account for any confounding factors.

  • Iterate and refine your experiments based on the insights gained from hold out testing. Use the data to make informed decisions about which changes to implement, modify, or discard.

By leveraging hold out testing and carefully measuring the impact of changes, you can make data-driven decisions that optimize your product's performance. This approach helps you avoid relying on short-term gains and ensures that your changes have a lasting, positive effect on your users' experience.

Interpreting hold out test results

Understanding discrepancies between individual and overall experiment results is crucial. Individual experiments may show positive impacts, but the overall impact from the holdout test could be lower. This difference is due to factors like regression to the mean and shrinkage.

Monte-Carlo simulations can provide more accurate impact estimations. These simulations account for uncertainties in non-significant results, giving a realistic overall impact. They help avoid overestimating the combined effect of multiple experiments.

Balancing short-term gains with long-term effects is essential in decision-making. Holdout tests reveal the true, lasting impact of changes. They prevent over-optimizing for immediate metrics while neglecting potential negative long-term consequences.

When interpreting holdout test results, consider:

  • The magnitude of the discrepancy between individual and overall results

  • The uncertainty ranges of non-significant results in Monte-Carlo simulations

  • The trade-offs between short-term improvements and long-term user engagement

Holdout testing ensures that your experiments genuinely improve user experience. It helps you make informed decisions based on the big picture, not just isolated metrics. By carefully analyzing holdout test results, you can optimize your product for sustainable growth.

Best practices for hold out testing

Managing technical debt during hold out periods is crucial. Regularly assess the impact of excluding the hold out group from updates. Develop a plan to address any accumulated technical debt post-test.

Critical updates and security changes should be handled carefully in hold out groups. Assess the necessity of including these updates in the hold out group. If essential, consider temporarily suspending the test or creating a separate hold out group for these updates.

Integrating hold out testing into your overall experimentation strategy is key. Determine the appropriate frequency and duration of hold out tests based on your product's lifecycle and experimentation velocity. Align hold out tests with your product roadmap and key metrics.

Clearly communicate the purpose and implications of hold out testing to stakeholders. Ensure that everyone understands the value of hold out testing and its role in validating experiment results. Provide regular updates on the progress and findings of hold out tests.

Continuously monitor the hold out group for any unexpected behavior or issues. Set up alerts to notify you of any significant deviations in metrics or user feedback. Be prepared to adjust or terminate the hold out test if necessary.

Leverage feature flags to efficiently manage hold out groups. Use feature flags to control the rollout of changes to specific user segments. This allows for easy inclusion or exclusion of the hold out group from specific updates.

Analyze and document the results of hold out tests thoroughly. Compare the metrics and user behavior of the hold out group with the treatment groups. Use these insights to refine your experimentation process and inform future product decisions.

Loved by customers at every stage of growth

See what our users have to say about building with Statsig
OpenAI
"At OpenAI, we want to iterate as fast as possible. Statsig enables us to grow, scale, and learn efficiently. Integrating experimentation with product analytics and feature flagging has been crucial for quickly understanding and addressing our users' top priorities."
Dave Cummings
Engineering Manager, ChatGPT
SoundCloud
"We evaluated Optimizely, LaunchDarkly, Split, and Eppo, but ultimately selected Statsig due to its comprehensive end-to-end integration. We wanted a complete solution rather than a partial one, including everything from the stats engine to data ingestion."
Don Browning
SVP, Data & Platform Engineering
Recroom
"Statsig has been a game changer for how we combine product development and A/B testing. It's made it a breeze to implement experiments with complex targeting logic and feel confident that we're getting back trusted results. It's the first commercially available A/B testing tool that feels like it was built by people who really get product experimentation."
Joel Witten
Head of Data
"We knew upon seeing Statsig's user interface that it was something a lot of teams could use."
Laura Spencer
Chief of Staff
"The beauty is that Statsig allows us to both run experiments, but also track the impact of feature releases."
Evelina Achilli
Product Growth Manager
"Statsig is my most recommended product for PMs."
Erez Naveh
VP of Product
"Statsig helps us identify where we can have the most impact and quickly iterate on those areas."
John Lahr
Growth Product Manager
"The ability to easily slice test results by different dimensions has enabled Product Managers to self-serve and uncover valuable insights."
Preethi Ramani
Chief Product Officer
"We decreased our average time to decision made for A/B tests by 7 days compared to our in-house platform."
Berengere Pohr
Team Lead - Experimentation
"Statsig is a powerful tool for experimentation that helped us go from 0 to 1."
Brooks Taylor
Data Science Lead
"We've processed over a billion events in the past year and gained amazing insights about our users using Statsig's analytics."
Ahmed Muneeb
Co-founder & CTO
SoundCloud
"Leveraging experimentation with Statsig helped us reach profitability for the first time in our 16-year history."
Zachary Zaranka
Director of Product
"Statsig enabled us to test our ideas rather than rely on guesswork. This unlocked new learnings and wins for the team."
David Sepulveda
Head of Data
Brex
"Brex's mission is to help businesses move fast. Statsig is now helping our engineers move fast. It has been a game changer to automate the manual lift typical to running experiments and has helped product teams ship the right features to their users quickly."
Karandeep Anand
President
Ancestry
"We only had so many analysts. Statsig provided the necessary tools to remove the bottleneck. I know that we are able to impact our key business metrics in a positive way with Statsig. We are definitely heading in the right direction with Statsig."
Partha Sarathi
Director of Engineering
"Statsig has enabled us to quickly understand the impact of the features we ship."
Shannon Priem
Lead PM
Ancestry
"I know that we are able to impact our key business metrics in a positive way with Statsig. We are definitely heading in the right direction with Statsig."
Partha Sarathi
Director of Engineering
"Working with the Statsig team feels like we're working with a team within our own company."
Jeff To
Engineering Manager
"[Statsig] enables shipping software 10x faster, each feature can be in production from day 0 and no big bang releases are needed."
Matteo Hertel
Founder
"We use Statsig's analytics to bring rigor to the decision-making process across every team at Wizehire."
Nick Carneiro
CTO
Notion
"We've successfully launched over 600 features behind Statsig feature flags, enabling us to ship at an impressive pace with confidence."
Wendy Jiao
Staff Software Engineer
"We chose Statsig because it offers a complete solution, from basic gradual rollouts to advanced experimentation techniques."
Carlos Augusto Zorrilla
Product Analytics Lead
"We have around 25 dashboards that have been built in Statsig, with about a third being built by non-technical stakeholders."
Alessio Maffeis
Engineering Manager
"Statsig beats any other tool in the market. Experimentation serves as the gateway to gaining a deeper understanding of our customers."
Toney Wen
Co-founder & CTO
"We finally had a tool we could rely on, and which enabled us to gather data intelligently."
Michael Koch
Engineering Manager
Notion
"At Notion, we're continuously learning what our users value and want every team to run experiments to learn more. It's also critical to maintain speed as a habit. Statsig's experimentation platform enables both this speed and learning for us."
Mengying Li
Data Science Manager
Whatnot
"Excited to bring Statsig to Whatnot! We finally found a product that moves just as fast as we do and have been super impressed with how closely our teams collaborate."
Rami Khalaf
Product Engineering Manager
"We realized that Statsig was investing in the right areas that will benefit us in the long-term."
Omar Guenena
Engineering Manager
"Having a dedicated Slack channel and support was really helpful for ramping up quickly."
Michael Sheldon
Head of Data
"Statsig takes away all the pre-work of doing experiments. It's really easy to setup, also it does all the analysis."
Elaine Tiburske
Data Scientist
"We thought we didn't have the resources for an A/B testing framework, but Statsig made it achievable for a small team."
Paul Frazee
CTO
Whatnot
"With Warehouse Native, we add things on the fly, so if you mess up something during set up, there aren't any consequences."
Jared Bauman
Engineering Manager - Core ML
"In my decades of experience working with vendors, Statsig is one of the best."
Laura Spencer
Technical Program Manager
"Statsig is a one-stop shop for product, engineering, and data teams to come together."
Duncan Wang
Manager - Data Analytics & Experimentation
Whatnot
"Engineers started to realize: I can measure the magnitude of change in user behavior that happened because of something I did!"
Todd Rudak
Director, Data Science & Product Analytics
"For every feature we launch, Statsig saves us about 3-5 days of extra work."
Rafael Blay
Data Scientist
"I appreciate how easy it is to set up experiments and have all our business metrics in one place."
Paulo Mann
Senior Product Manager
We use cookies to ensure you get the best experience on our website.
Privacy Policy