Many companies want to 10x their experimentation velocity. Here are 5 techniques from sophisticated experimenters that help you do this—
Feature Rollouts: auto-measure new feature impact with an a/b test
Parameters: remove experiment variants in code to iterate faster
Layers: remove hardcoded experiment references from code
CUPED: use statistical techniques to get results faster
Holdouts: measure cumulative impact/progress without grunt work
Most tooling is painful and error-prone. This makes teams spend countless hours and sweat on experimentation—limiting what gets tested. Companies that do this… understand the value of experimenting, but get a fraction of the value they should.
Many of the largest and most successful tech companies have figured out how to run experiments at an industrial scale. They make it easy for individual teams to measure the impact of each new feature on the company or organization's KPIs. This superpower brings data into the decision-making process, preventing endless debates and meetings.
Modern products ship new features behind a feature gate so they can control who sees features.
When there’s a partial rollout to a set of equivalent users, that is enough for Statsig to turn that into an A/B test. In this example, Statsig compares metrics for users Passing (10% rollout, Test) with those failing (90% not yet rolled out — Control).
The image below shows an example of a Pulse Report that shows a lift in metrics between Control and Test.
Using Statsig feature gates to rollout new features removes the cognitive load of turning every rollout into an experiment—while still giving you observability into the impact the rollout has.
The legacy way to implement experiments is to have a bunch of if-then-else blocks in your code to handle each variant.
A more agile way to implement an experiment is to simply retrieve the button color from the experiment in Statsig.
You can now restart the experiment with a new set of colors to test, without touching shipped code. You can even increase the number of variants—test three colors instead of two—just by changing the config in Statsig.
When you’re working with mobile apps, the difference between the two approaches is night and day. You can rerun experiments even on older app versions without waiting for new code with a new if-then-else statement to hit the app stores. No more waiting for users to upgrade to the latest version of the app before you start to collect data!
The best in-house, next-gen experimentation systems use similar approaches. Read how Uber does something similar to unlock agility with their experimentation (Architecture section)
Experiment Parameters help you move faster. When you want to move even faster, hard-coded Experiment names become a bottleneck. What if you could ship another experiment without updating your code?
Layers enable this. Layers are typically used to run mutually exclusive experiments. They are also used to remove direct references to experiment names in code.
In the example below, elements on the app’s home screen are set up as parameters on the “Home Screen” layer—button_color, button_text and button_icon. The app simply retrieves parameters from this layer, without any awareness of experiments on the home screen.
If there are no experiments active in the layer, the default layer parameters apply. In the example below, there are three experiments active — with users split between them (mutual isolation). These experiments can control all or a subset of the layer parameters.
You can complete old experiments and start new experiments without touching the client app at all.
Controlled-experiment Using Pre-Experiment Data is a technique to reduce variance and bias in results. Think of it as noise-reduction - we look at noise in metrics before the experiment started to reduce noise in results.
Looking across hundreds of customers—it reduces the sample sizes and durations for over half the key metrics measured in experiments. Learn more about our CUPED implementation. There are other statistical techniques including winsorization (limiting outlier values) that are also applied, but they typically don’t have as big an impact.
Team or product-level holdouts are powerful tools to measure the cumulative impact of features and experiments you’ve shipped over a holdout period (often ~6 months). You can tease apart the impact of external factors (e.g. your competitor going out of business) and seasonality (atypical events including holidays, unusual news cycles or weather) from the impact driven by your feature launches. You can also measure long-term effects and quantify subtle ecosystem changes.
Mature product teams use long-term holdouts. These can be expensive for engineers to set up—everyone creating a feature or an experiment needs to be aware of and respect this holdout.
On Statsig — creating a global holdout automatically applies them to new features gates and experiments. People creating them don’t have to do any manual work to check the Holdout.
Learn more about how the feature works and best practices around managing holdouts.
This isn’t an exhaustive list. e.g. 6. Want to run hundreds of multi-armed bandits where you trust the system to pick a winner based on an optimization function? There’s Autotune. e.g. 7. Want to look at key metrics in near real-time? There’s Event Explorer. 8. Want to spin a quick new metric the same day, for a new feature you’re building? We’ve got you. 9. Reuse the data team-approved canonical metrics for your company from your warehouse? You can do that. 10. Want feature teams to self serve slicing data by OS, Country, Free vs Paid or another dimension you choose so they’re not blocked behind a data team crafting manual queries? Yes.
There are many more of these…
We created Statsig to close the experimentation gap between sophisticated experimenters and others. Feel free to reach out to talk about other ideas that accelerate experimentation!
Thanks to Tore
Thanks to our support team, our customers can feel like Statsig is a part of their org and not just a software vendor. We want our customers to know that we're here for them.
Migrating experimentation platforms is a chance to cleanse tech debt, streamline workflows, define ownership, promote democratization of testing, educate teams, and more.
Calculating the right sample size means balancing the level of precision desired, the anticipated effect size, the statistical power of the experiment, and more.
The term 'recency bias' has been all over the statistics and data analysis world, stealthily skewing our interpretation of patterns and trends.
A lot has changed in the past year. New hires, new products, and a new office (or two!) GB Lee tells the tale alongside pictures and illustrations:
A deep dive into CUPED: Why it was invented, how it works, and how to use CUPED to run experiments faster and with less bias.