Split testing has become an important tool for companies across many industries. There’s a huge amount of literature (and Medium posts!) dedicated to examples and explanations of why this is, and why large companies in Tech have built their cultures around designing products in a hypothesis-driven way.
Today, many teams are only using this powerful tool for binary ship/no-ship decisions on product changes. It’s not quite keeping your Ferrari in the garage, but it’s pretty similar to buying a Swiss army knife for just the knife:
I want to talk a bit about how testing can function as a powerful tool for analysis and understanding, and how it can be useful for much more than launch decisions.
The Standard: Pre-Launch Experimentation
It’s common that when people think about testing, they mostly think about pre-launch experimentation:
This is a powerful framework because testing removes the need to understand all of the behaviors, motivations, and people using your product. By running a well-designed test, you can get the critical information on “what this feature achieves” while letting random assignment control for confounding variables and bias.
Testing can be much more than just a tool for aiding your launch decisions. We don’t test because it tells us the “correct decision” to make. Testing is a tool for measuring the effect of a change in isolation. Launch tests are an important use of that tool, but only one use.
Let’s go through a quick list of use cases for testing outside of the scope of launch decisions:
I’ve seen all of the above used to great effect, and there’s many more valuable use cases out there. I’ve used experiments to measure price elasticity, regress performance to estimate the value of increasing it, and to evaluate headroom for ranking boosts. All of these insights would be almost impossible to generate without having some kind of user test set up.
As a Data Scientist raised in product teams, I’m usually the first person to push back on optimistic interpretations of experiment results. Practitioners will generally advise fairly strict frameworks for experimental evaluation — only look at metrics you hypothesized about, use Bonferroni corrections liberally, don’t p-hack by looking at subgroup results, etc., etc.
While it’s true we need to be careful of misusing data, harm only happens when we take action on incidental findings, or treat them as concrete evidence.
Tests should be run and evaluated on a specific hypothesis, but ignoring all other results would be a disservice to yourself and your team. Statistical rigor is needed for decision making, but using test results for learnings and light inference is exceptionally valuable. Looking at results on sub-populations, tracking time-series of experimental lift, or paying attention to unexpected/secondary metrics moving should be seen as valuable, but not definitive information. It’s a valuable opportunity to observe causal relationships between the changes you make and users’ behavior.
False positives are a reality of testing — with 5% significance, expect 1 in 20 metrics to be statistically significant by chance. This requires you to don a skeptic’s hat and be cautious about drawing conclusions, but I still encourage you to look at unexpected results, develop a hypothesis of why it might be happening, and then think about how you might test, explore, or otherwise validate that new hypothesis. Some of the biggest wins I’ve seen in my career were from follow-ups to someone saying “that’s interesting…” in an experiment review.
Testing is an intimidating space, with lots of pitfalls, competing approaches, and strong opinions being thrown around. There’s many traps to fall into — peeking, ratio mismatches, underpowered tests, and publication bias to name a few.
However, at the end of the day testing is based on a simple question: “all else equal, what happens if we change x?” This is fundamentally a simple — and exciting — concept. Any time you think “I wonder if…”, there’s an opportunity to develop a new hypothesis and learn more about how the world (and your products) work. Take advantage!
This topic is dear to my heart both as a Data Scientist, but also as part of a team that’s building an experimentation platform at Statsig. Everyone here came from companies that invested heavily into experimentation, and we’ve seen the valuable insights that experiments can drive.
For each of the use cases above, I highlighted the names of the Statsig tools I’d choose. I’d encourage you to check out the demo site if you’re interested in learning more from your experimentation!
Explore Statsig’s smart feature gates with built-in A/B tests, or create an account instantly and start optimizing your web and mobile applications. You can also schedule a live demo or chat with us to design a custom package for your business.
💡 How to decide between leaning on data vs. research when diagnosing and solving product problems Four heuristics I’ve found helpful when deciding between data vs. research to diagnose + solve a problem. Earth image credit of Moncast Drawing. As a PM, data...
Have you ever sent an email to the wrong person? Well I have. At work. From a generic support email address. To a group of our top customers. Facepalm. In March of 2018, I was working on the games team at Facebook. You may remember that month as a tumultuous...
Run experiments with more speed and accuracy We’re pleased to announce the rollout of CUPED for all our customers. Statsig will now automatically use CUPED to reduce variance and bias on experiments’ key metrics. This gives you access to a powerful experiment...
You Can’t Invent Without Experimenting When Amazon launched Home Services, the team was convinced that most people want to schedule home installations in the mornings, evenings, or weekends. This naturally constrained the number of available time slots, and...
Training your team to make independent decisions Image Courtesy: The New Yorker “It was like the debate of a group of savages as to how to extract a screw from a piece of wood. Accustomed only to nails, they had made one effort to pull out the screw by main...
Photo by Joshua Hoehne on Unsplash By now, most people realize that when they open Facebook or Instagram on their phone, their experience is very different than the person next to them. It goes deeper than just the content that you see, and the ranking...