Good teams move fast. They’re trying several ideas at any given time. When they find something that works, they ship it and find the next idea to try. With some features — it is useful to measure if individual wins sustain after prolonged exposure (e.g. adding app badging or notifications).
With other features (e.g. showing ads), there may be no short term effect, but you want to understand long term effects. You do this by creating a holdout. You keep the feature away from 1% of your users — and measure the difference between this group and the other 99% after several months. This helps ensures you’re building for the long term and not over-optimizing for short-term wins.
Another key use case is with measuring cumulative impact across many features. If a new shopping app ships 10 features over a quarter, each showing a 2% increase in total revenue — it’s unlikely they’d see a 20% increase at the end of the quarter. There’s often statistical noise with feature level measurement, and some interaction and cannibalization across features. You may end up with only a 12% total win from the quarter you shipped 10 features in. Creating a holdout across these features lets you measure the actual impact by keeping a small set of users (1–2%) who don’t get new features during this period — and comparing metrics for them relative to users who did get everything you chose to ship.
At Facebook, most product team on the core app calculate the cumulative impact of all features shipped over the last 6 months. This aligns with their goal setting and performance review process. At the start of every period — they create a small holdout (1–5% of users). At the end of the half, they measure the impact of the features they shipped by comparing metrics against the holdout group. They release the holdout, and start a new one for the next half.
Team or product level holdouts are powerful. You can tease apart the impact of external factors (e.g. your competitor going out of business) and seasonality (atypical events including holidays, unusual news cycles or weather) from the impact driven by your feature launches. You can also measure long-term effects and quantify subtle ecosystem changes.
1. Engineering overhead. For each feature you ship with a holdout, you’re committing to support an if-then-else fork in your code. For a fast moving team, having to support multiple code paths makes the test and debug matrix large. Shorter holdouts help make this manageable. When your legacy code path breaks — if you don’t find and fix it swiftly, your holdout results become untrustworthy.
Typically once an experiment is shipped (or a feature finishes rollout) — you go back and clean up your code to remove the branching logic that checks the experiment or feature gate status. When using holdouts — you save this cleanup till when the holdout is retired. Many teams will make a focused push with a few engineers — instead of asking each engineer that shipped a feature to clean up the code base.
2. Opportunity cost. When you ship features that increase revenue or retention, a large holdout means leaving those gains on the table. There’s also dissatisfaction you cause when someone sees a friend with a shiny new feature that they don’t have.
One of the most expensive holdouts Facebook runs is a long term Ads holdout. Yes — there are a set of people that get to use Facebook without advertisements! FB values this because it helps them measure the costs of ads on engagement. It also helps them isolate the impact of ad specific bugs.
3. Monitoring. Holdouts are typically analyzed in detail only at the end of the holdout period. It’s useful to check in on them at a regular cadence (e.g. monthly) to make sure there isn’t anything unexpected that may taint the holdout. A broken control variant impacting only 1% of your users can make the Holdout useless if you only detect it at the end of the holdout period. There’s little point comparing metrics between users with new features and users with a broken experience. The act of checking the performance of the Holdout group can spawn investigations to understand unexpected movements.
4. Users, Customer Support & Marketing. For people in the holdout, it is confusing to see friends get a spiffy new feature, while you don’t. It’s important to retire and create new holdouts every cycle — so it’s not the same set of users punished again and again.
Customer support needs to easily diagnose users who complain about a missing feature that’s publicly available. Marketing splashes about a new feature need to be careful if the holdout is unreasonably large and is likely to cause negative sentiment.
The previous section outlined some key costs. Holdouts are not cheap. To make sure you get value from your holdout some tips include
Have a clear set of questions the holdout is designed to answer. This will guide your design, holdout duration, value you get and will dictate what costs make sense to incur.
When Facebook shipped game streaming, they shipped a test that invited people to join the streamer’s community when watching a streamer’s video. Four weeks in, the topline results were neutral. More people joined communities, but the business metrics hadn’t moved.
The team was convicted that this was the right thing to do by users and shipped the feature with a small holdout. Four months later the holdout helped measure a double digit increase in topline metrics from this feature.
Building communities takes time. If you have conviction in your feature, launching and using Holdouts to validate intuition lets you move fast, while validating progress with time.
2. Use a power analysis calculator to size the holdout. Holdouts measure over a longer period of time — so make them as small as is reasonable. Keep in mind that the final readout of the holdout will likely aggregate the metrics over the last 1–4 weeks rather than the entire holdout period. This captures the final impact of all the shipped features.
When Instagram started adding in advertising, they started tentatively. Because they had a small ad load, they sized a large Holdout so it was sensitive to small effects. When the ad business grew, they realized that the Holdout was oversized and way too expensive relative to it’s value. They ended up shrinking it dramatically. This was a non-trivial task (took months to validate and launch, with hacky changes across multiple codebases) and quite risky as it could have ended with the main holdout the company used to validate the impact of ads becoming compromised. It’s a good reminder to think through the long term impact of maintaining a holdout, from engineering cost to actual business cost, and factor that into the decisions from sizing, scoping, to whether or not you should even create one.
3. Understand the costs associated with holdouts. Make sure teams that will pay those costs understand holdout goals and buy in.
4. Getting a holdout wrong is very expensive. You write the bad holdout off and have to wait a quarter or a half for your next try. Optimize for simple and reliable over sophisticated and complex. If you’re just getting started with holdouts, re-read this bullet again.
1. Infra changes and bug fixes. These tend to be poor candidates for holdouts. The cost of supporting new and old infra can outweigh the benefits of doing this. Holding back bug fixes knowingly gives users broken experiences.
2. Cross-user features. If your feature requires that others also have the feature for it to work, you’re breaking the feature if you have a holdout. E.g. if you ship collaborative editing in a business productivity app — you’re better off holding out some organizations from this feature instead of keeping a small percentage of users in every organization from this and breaking the feature for them and their team mates.
3. No org commitment. See section on costs. Holdouts require commitment, and if the questions your Holdout is designed to answer aren’t a priority for your business, you’re better off skipping this.
4. Backtests. There are a set of features where backtests are a better efficient way to measure impact. A backtest is effectively an after-the-fact holdout. You take back a feature from a small set of users and then compare their metrics to everyone else to quantify the impact.
Backtests make sense when you’re happy with the result of a feature but want to make sure it reproduces (or make sure some negative guardrail impact doesn’t reproduce). With these you’re not as worried that short term vs long term effect.
This works best for infra changes that aren’t user visible — when they won’t see a feature disappear on them.
In summary — move fast, be inventive, run many experiments. If you get one of many wrong, you’ve lost only a few weeks of data collection. With holdouts, be measured. A bad holdout can cost you months of data collection before you realize it. Start simple to optimize for success. After you’ve found success with simple holdouts, evolve these to support more ambitious goals.
Thanks to our support team, our customers can feel like Statsig is a part of their org and not just a software vendor. We want our customers to know that we're here for them.
Migrating experimentation platforms is a chance to cleanse tech debt, streamline workflows, define ownership, promote democratization of testing, educate teams, and more.
Calculating the right sample size means balancing the level of precision desired, the anticipated effect size, the statistical power of the experiment, and more.
The term 'recency bias' has been all over the statistics and data analysis world, stealthily skewing our interpretation of patterns and trends.
A lot has changed in the past year. New hires, new products, and a new office (or two!) GB Lee tells the tale alongside pictures and illustrations:
A deep dive into CUPED: Why it was invented, how it works, and how to use CUPED to run experiments faster and with less bias.