Hold-out testing is a method in A/B testing that measures the long-term effects of product changes. In a hold-out test, a small group of users is not shown the changes being tested for an extended period, typically weeks or months after the main experiment ends. This allows you to ensure that the experiment doesn't have negative long-term effects or interactions with other experiments.
By comparing the metrics of the hold-out group to the group exposed to the changes, you can gain a more accurate understanding of the cumulative impact of your experiments. This is especially important when running multiple experiments simultaneously, as it helps identify any unexpected interactions between tests that could skew results.
Implementing a hold-out group is a best practice in experimentation, as it provides a safety net against unintended consequences. It's particularly valuable for high-stakes experiments or changes that could potentially impact key metrics like revenue or user retention. By monitoring the hold-out group over time, you can confidently roll out winning variations knowing they won't cause long-term harm.
Once your experiment reaches significance, the real holdout test begins. If the "control" variant wins, you can roll it out to 100% and stop the experiment. However, if the "test" variant wins, keep it at 90% and the holdout at 10%.
Continue monitoring metrics for the two flag variants using insights, dashboards, or session replay. Filter for events where the feature flag value equals either "test" or "holdout". Look at metrics like:
Pages per session and errors per session
Average session duration
Conversion rate
Real usage in session replays
After a few more weeks without issues, you can fully roll out the "test" flag. If you do see a problem, the holdout group enables quick rollback. From there, restart the improvement process.
While basic, fixed horizon, frequentist A/B tests can take you far, advanced statistical techniques offer immense value. Once you have the basics of an experimentation platform, consider investing in:
Sequential testing and peak proof analysis: Allows for always-valid p-values and early experiment decisions.
Assignment strategies: Supports randomization at non-user levels and nuanced strategies like new vs. existing user segmentation.
Variance reduction: Techniques like outlier capping and CUPED can dramatically reduce metric variance and increase experimental throughput.
Quasi experiments: Estimate counterfactuals when well-randomized experiments aren't possible, using techniques like linear regression with fixed effects and difference-in-difference modeling.
Exploring tactics like multi-arm bandits, Bayesian methodologies, distributional comparisons, and causal modeling can further enhance your experimentation capabilities. The key is building on a solid foundation and incrementally adding advanced techniques that deliver value.
Once your hold-out test reaches significance, take action based on the winning variant. If the control wins, roll it out to 100% and stop the experiment. If the test variant wins, keep the hold-out at 10% and roll out the test to 90%.
Continue monitoring key metrics for the test and hold-out groups after reaching significance. Look for issues like increased errors or degraded usage in the test group. Session replays can provide valuable insights into real user behavior.
Comparing the hold-out group to users exposed to changes provides a more accurate measurement of cumulative impact. The hold-out acts as a consistent baseline, while the exposed group reflects the aggregate effect of all changes over time.
This comparison removes biases like the "winner's curse" that can inflate impact when measuring experiments individually. By analyzing metrics for the two groups, you gain a clearer understanding of how your experimentation program has impacted the business overall.
The two-group holdout approach involves creating a held-out status quo group and a held-out winners group. The status quo group consistently experiences the control, while the winners group sees winning treatments as they launch. This method reduces error risks by ensuring the non-holdout group's composition remains consistent, only exposing them to winning variations. It also allows for earlier signals by promptly exposing the winners group to successful treatments, minimizing the holdout duration compared to other techniques.
For customers using their own feature flagging solution, the analysis-only mode offers access to a comprehensive suite of holdout analysis tools. This includes diagnostics for holdout health, comparisons with business metrics, and detailed holdout reports. You can also link your holdouts to experiments within the platform to illustrate the impact of individual experiments on the holdout. This modular approach allows you to leverage advanced holdout analysis capabilities while maintaining your existing deployment infrastructure.
By employing these advanced holdout strategies, you can accurately measure the cumulative impact of your experimentation program. Holdouts provide a comparative lens between users who experienced the original product and those exposed to changes over time. This aggregate measurement tends to be more accurate than summing individual experiment impacts, as it accounts for biases like the "winner's curse." With a clearer understanding of how your experimentation efforts affect key business metrics, you can make data-driven decisions and optimize your program's effectiveness.