Many companies want to 10x their experimentation velocity. Here are 5 techniques from sophisticated experimenters that help you do this—
Feature Rollouts: auto-measure new feature impact with an a/b test
Parameters: remove experiment variants in code to iterate faster
Layers: remove hardcoded experiment references from code
CUPED: use statistical techniques to get results faster
Holdouts: measure cumulative impact/progress without grunt work
Most tooling is painful and error-prone. This makes teams spend countless hours and sweat on experimentation—limiting what gets tested. Companies that do this… understand the value of experimenting, but get a fraction of the value they should.
Many of the largest and most successful tech companies have figured out how to run experiments at an industrial scale. They make it easy for individual teams to measure the impact of each new feature on the company or organization's KPIs. This superpower brings data into the decision-making process, preventing endless debates and meetings.
Modern products ship new features behind a feature gate so they can control who sees features.
When there’s a partial rollout to a set of equivalent users, that is enough for Statsig to turn that into an A/B test. In this example, Statsig compares metrics for users Passing (10% rollout, Test) with those failing (90% not yet rolled out — Control).
The image below shows an example of a Pulse Report that shows a lift in metrics between Control and Test.
Using Statsig feature gates to rollout new features removes the cognitive load of turning every rollout into an experiment—while still giving you observability into the impact the rollout has.
The legacy way to implement experiments is to have a bunch of if-then-else blocks in your code to handle each variant.
A more agile way to implement an experiment is to simply retrieve the button color from the experiment in Statsig.
You can now restart the experiment with a new set of colors to test, without touching shipped code. You can even increase the number of variants—test three colors instead of two—just by changing the config in Statsig.
When you’re working with mobile apps, the difference between the two approaches is night and day. You can rerun experiments even on older app versions without waiting for new code with a new if-then-else statement to hit the app stores. No more waiting for users to upgrade to the latest version of the app before you start to collect data!
The best in-house, next-gen experimentation systems use similar approaches. Read how Uber does something similar to unlock agility with their experimentation (Architecture section)
Experiment Parameters help you move faster. When you want to move even faster, hard-coded Experiment names become a bottleneck. What if you could ship another experiment without updating your code?
Layers enable this. Layers are typically used to run mutually exclusive experiments. They are also used to remove direct references to experiment names in code.
In the example below, elements on the app’s home screen are set up as parameters on the “Home Screen” layer—button_color, button_text and button_icon. The app simply retrieves parameters from this layer, without any awareness of experiments on the home screen.
If there are no experiments active in the layer, the default layer parameters apply. In the example below, there are three experiments active — with users split between them (mutual isolation). These experiments can control all or a subset of the layer parameters.
You can complete old experiments and start new experiments without touching the client app at all.
Controlled-experiment Using Pre-Experiment Data is a technique to reduce variance and bias in results. Think of it as noise-reduction - we look at noise in metrics before the experiment started to reduce noise in results.
Looking across hundreds of customers—it reduces the sample sizes and durations for over half the key metrics measured in experiments. Learn more about our CUPED implementation. There are other statistical techniques including winsorization (limiting outlier values) that are also applied, but they typically don’t have as big an impact.
Team or product-level holdouts are powerful tools to measure the cumulative impact of features and experiments you’ve shipped over a holdout period (often ~6 months). You can tease apart the impact of external factors (e.g. your competitor going out of business) and seasonality (atypical events including holidays, unusual news cycles or weather) from the impact driven by your feature launches. You can also measure long-term effects and quantify subtle ecosystem changes.
Mature product teams use long-term holdouts. These can be expensive for engineers to set up—everyone creating a feature or an experiment needs to be aware of and respect this holdout.
On Statsig — creating a global holdout automatically applies them to new features gates and experiments. People creating them don’t have to do any manual work to check the Holdout.
This isn’t an exhaustive list. e.g. 6. Want to run hundreds of multi-armed bandits where you trust the system to pick a winner based on an optimization function? There’s Autotune. e.g. 7. Want to look at key metrics in near real-time? There’s Event Explorer. 8. Want to spin a quick new metric the same day, for a new feature you’re building? We’ve got you. 9. Reuse the data team-approved canonical metrics for your company from your warehouse? You can do that. 10. Want feature teams to self serve slicing data by OS, Country, Free vs Paid or another dimension you choose so they’re not blocked behind a data team crafting manual queries? Yes.
There are many more of these…
We created Statsig to close the experimentation gap between sophisticated experimenters and others. Feel free to reach out to talk about other ideas that accelerate experimentation!
Thanks to Tore
Kong is our Typescript-based write-once-run on every SDK framework. “Write once, run anywhere” is always a dream for programmers, and now we have just that!
LaunchDarkly was mandatory for every new feature in Motion’s backend, web app, and Chrome extension. "It was obvious this was a huge mistake."
Last Tuesday, Statsig brought a cadre of data science and experimentation fans together at a loft space in San Francisco for the first-ever Data Science Meetup.
Well-designed experimentation is the first step in creating a rollout structure that consistently delivers optimal results—whatever they may be.
Using data and experimentation, the Obama 2012 campaign generated over one billion dollars in donations, nearly $700,000,000 of which were online.
It’s only my first week yet, but each day I am more and more impressed by the team’s velocity, excitement, and transparency, and feeling more sure that I’ve made the right decision for /me/.
Explore Statsig’s smart feature gates with built-in A/B tests, or create an account instantly and start optimizing your web and mobile applications. You can also schedule a live demo or chat with us to design a custom package for your business.