Split testing has become an important tool for companies across many industries. There’s a huge amount of literature (and Medium posts!) dedicated to examples and explanations of why this is, and why large companies in Tech have built their cultures around designing products in a hypothesis-driven way.
Today, many teams are only using this powerful tool for binary ship/no-ship decisions on product changes. It’s not quite keeping your Ferrari in the garage, but it’s pretty similar to buying a Swiss army knife for just the knife:
I want to talk a bit about how testing can function as a powerful tool for analysis and understanding, and how it can be useful for much more than launch decisions.
The Standard: Pre-Launch Experimentation
It’s common that when people think about testing, they mostly think about pre-launch experimentation:
A user need is surfaced or hypothesized
A solution is proposed
An MVP of the solution is designed
The target population is split randomly for a test, where some get the solution (Test) and some don’t (Control)
Comparing outcomes between the users with the Test and Control experiences gives the team information on if the solution worked. Based on the results, they might ship the full solution, iterate on it (go to 3), or scrap the idea and try something new (go back to 2)
This is a powerful framework because testing removes the need to understand all of the behaviors, motivations, and people using your product. By running a well-designed test, you can get the critical information on “what this feature achieves” while letting random assignment control for confounding variables and bias.
Testing can be much more than just a tool for aiding your launch decisions. We don’t test because it tells us the “correct decision” to make. Testing is a tool for measuring the effect of a change in isolation.Launch tests are an important use of that tool, but only one use.Let’s go through a quick list of use cases for testing outside of the scope of launch decisions:
Checking your assumptions. Often, you’ll think a feature is an obvious win for users. You might not want to delay the launch for multiple weeks to do a full-fledged pre-test. Launching with a small subset of users behind a split feature gate lets you move forward with confidence — you can launch quickly, but still check that the feature is working as intended. There’s a great example of this providing value for a Statsig client here.
Measuring the long-term efficacy of a feature. By keeping a feature from a tiny portion of your users with a Holdout or Feature Gate, you can easily track the “current value” of that feature and prioritize accordingly instead of operating on outdated information when your users (or the world) has changed.
Measuring how much new features contributed to growth. When there’s a big movement in business metrics, it’s important to know if it’s because of what you’re doing, or because of external factors. Without a holdout, I’ve seen this turn into a painful debate with no ground truth. With a holdout, you can instantly get an estimate of how well the features you built are working.
Diagnosing regressions. If a key metric goes down unexpectedly, it can lead to an all-hands-on-deck situation as teams scramble to figure out what happened. With
holdouts and feature gates, it’s simple to check if a team’s changes or a recent launch change led to a large drop. In the Statsig console, we track rollouts in metric dashboards for easy diagnosis.
Running proof-of-concept tests to inform prioritization. If you’re not sure about a potential feature, you can run a scoped-down, yet significant change as an experiment
to get directional signal on that feature before choosing to commit. There’s a great example from my colleague Pierre here (and another from Ritvik here!).
I’ve seen all of the above used to great effect, and there’s many more valuable use cases out there. I’ve used experiments to measure price elasticity, regress performance to estimate the value of increasing it, and to evaluate headroom for ranking boosts. All of these insights would be almost impossible to generate without having some kind of user test set up.
As a Data Scientist raised in product teams, I’m usually the first person to push back on optimistic interpretations of experiment results. Practitioners will generally advise fairly strict frameworks for experimental evaluation — only look at metrics you hypothesized about, use Bonferroni corrections liberally, don’t p-hack by looking at subgroup results, etc., etc.
While it’s true we need to be careful of misusing data, harm only happens when we take action on incidental findings, or treat them as concrete evidence.
Tests should be run and evaluated on a specific hypothesis, but ignoring all other results would be a disservice to yourself and your team. Statistical rigor is needed for decision making, but using test results for learnings and light inference is exceptionally valuable. Looking at results on sub-populations, tracking time-series of experimental lift, or paying attention to unexpected/secondary metrics moving should be seen as valuable, but not definitive information. It’s a valuable opportunity to observe causal relationships between the changes you make and users’ behavior.
False positives are a reality of testing — with 5% significance, expect 1 in 20 metrics to be statistically significant by chance. This requires you to don a skeptic’s hat and be cautious about drawing conclusions, but I still encourage you to look at unexpected results, develop a hypothesis of why it might be happening, and then think about how you might test, explore, or otherwise validate that new hypothesis. Some of the biggest wins I’ve seen in my career were from follow-ups to someone saying “that’s interesting…” in an experiment review.
Testing is an intimidating space, with lots of pitfalls, competing approaches, and strong opinions being thrown around. There’s many traps to fall into — peeking, ratio mismatches, underpowered tests, and publication bias to name a few.
However, at the end of the day testing is based on a simple question: “all else equal, what happens if we change x?” This is fundamentally a simple — and exciting — concept. Any time you think “I wonder if…”, there’s an opportunity to develop a new hypothesis and learn more about how the world (and your products) work. Take advantage!
This topic is dear to my heart both as a Data Scientist, but also as part of a team that’s building an experimentation platform at Statsig. Everyone here came from companies that invested heavily into experimentation, and we’ve seen the valuable insights that experiments can drive.
For each of the use cases above, I highlighted the names of the Statsig tools I’d choose. I’d encourage you to check out the demo site if you’re interested in learning more from your experimentation!
Thanks to our support team, our customers can feel like Statsig is a part of their org and not just a software vendor. We want our customers to know that we're here for them.
Migrating experimentation platforms is a chance to cleanse tech debt, streamline workflows, define ownership, promote democratization of testing, educate teams, and more.
Calculating the right sample size means balancing the level of precision desired, the anticipated effect size, the statistical power of the experiment, and more.
The term 'recency bias' has been all over the statistics and data analysis world, stealthily skewing our interpretation of patterns and trends.
A lot has changed in the past year. New hires, new products, and a new office (or two!) GB Lee tells the tale alongside pictures and illustrations:
A deep dive into CUPED: Why it was invented, how it works, and how to use CUPED to run experiments faster and with less bias.