There’s More To Learn From Tests

Craig
Wed Apr 20 2022

Split testing has become an important tool for companies across many industries. There’s a huge amount of literature (and Medium posts!) dedicated to examples and explanations of why this is, and why large companies in Tech have built their cultures around designing products in a hypothesis-driven way.

Today, many teams are only using this powerful tool for binary ship/no-ship decisions on product changes. It’s not quite keeping your Ferrari in the garage, but it’s pretty similar to buying a Swiss army knife for just the knife:

(I’m better with data than with art)

I want to talk a bit about how testing can function as a powerful tool for analysis and understanding, and how it can be useful for much more than launch decisions.

The Standard: Pre-Launch Experimentation

It’s common that when people think about testing, they mostly think about pre-launch experimentation:

  1. A user need is surfaced or hypothesized
  2. A solution is proposed
  3. An MVP of the solution is designed
  4. The target population is split randomly for a test, where some get the solution (Test) and some don’t (Control)
  5. Comparing outcomes between the users with the Test and Control experiences gives the team information on if the solution worked. Based on the results, they might ship the full solution, iterate on it (go to 3), or scrap the idea and try something new (go back to 2)

This is a powerful framework because testing removes the need to understand all of the behaviors, motivations, and people using your product. By running a well-designed test, you can get the critical information on “what this feature achieves” while letting random assignment control for confounding variables and bias.

Unrealized Value: Testing to Understand

Testing can be much more than just a tool for aiding your launch decisions. We don’t test because it tells us the “correct decision” to make. Testing is a tool for measuring the effect of a change in isolation. Launch tests are an important use of that tool, but only one use.

Let’s go through a quick list of use cases for testing outside of the scope of launch decisions:

  1. Checking your assumptions. Often, you’ll think a feature is an obvious win for users. You might not want to delay the launch for multiple weeks to do a full-fledged pre-test. Launching with a small subset of users behind a split feature gate lets you move forward with confidence — you can launch quickly, but still check that the feature is working as intended. There’s a great example of this providing value for a Statsig client here.
  2. Measuring the long-term efficacy of a feature. By keeping a feature from a tiny portion of your users with a Holdout or Feature Gate, you can easily track the “current value” of that feature and prioritize accordingly instead of operating on outdated information when your users (or the world) has changed.
  3. Measuring how much new features contributed to growth. When there’s a big movement in business metrics, it’s important to know if it’s because of what you’re doing, or because of external factors. Without a holdout, I’ve seen this turn into a painful debate with no ground truth. With a holdout, you can instantly get an estimate of how well the features you built are working.
  4. Diagnosing regressions. If a key metric goes down unexpectedly, it can lead to an all-hands on deck situation as teams scramble to figure out what happened. With holdouts and feature gates, it’s simple to check if a team’s changes or a recent launch change led to a large drop. In the Statsig console, we track rollouts in metric dashboards for easy diagnosis.
  5. Running proof-of-concept tests to inform prioritization. If you’re not sure about a potential feature, you can run a scoped-down, yet significant change as an experiment to get directional signal on that feature before choosing to commit. There’s a great example from my colleague Pierre here (and another from Ritvik here!).

I’ve seen all of the above used to great effect, and there’s many more valuable use cases out there. I’ve used experiments to measure price elasticity, regress performance to estimate the value of increasing it, and to evaluate headroom for ranking boosts. All of these insights would be almost impossible to generate without having some kind of user test set up.

Don’t Waste Your Tests: Take Time to Think About The Results

As a Data Scientist raised in product teams, I’m usually the first person to push back on optimistic interpretations of experiment results. Practitioners will generally advise fairly strict frameworks for experimental evaluation — only look at metrics you hypothesized about, use Bonferroni corrections liberally, don’t p-hack by looking at subgroup results, etc., etc.

While it’s true we need to be careful of misusing data, harm only happens when we take action on incidental findings, or treat them as concrete evidence.

Tests should be run and evaluated on a specific hypothesis, but ignoring all other results would be a disservice to yourself and your team. Statistical rigor is needed for decision making, but using test results for learnings and light inference is exceptionally valuable. Looking at results on sub-populations, tracking time-series of experimental lift, or paying attention to unexpected/secondary metrics moving should be seen as valuable, but not definitive information. It’s a valuable opportunity to observe causal relationships between the changes you make and users’ behavior.

False positives are a reality of testing — with 5% significance, expect 1 in 20 metrics to be statistically significant by chance. This requires you to don a skeptic’s hat and be cautious about drawing conclusions, but I still encourage you to look at unexpected results, develop a hypothesis of why it might be happening, and then think about how you might test, explore, or otherwise validate that new hypothesis. Some of the biggest wins I’ve seen in my career were from follow-ups to someone saying “that’s interesting…” in an experiment review.

Parting Thoughts

Testing is an intimidating space, with lots of pitfalls, competing approaches, and strong opinions being thrown around. There’s many traps to fall into — peeking, ratio mismatches, underpowered tests, and publication bias to name a few.

However, at the end of the day testing is based on a simple question: “all else equal, what happens if we change x?” This is fundamentally a simple — and exciting — concept. Any time you think “I wonder if…”, there’s an opportunity to develop a new hypothesis and learn more about how the world (and your products) work. Take advantage!

Statsig

This topic is dear to my heart both as a Data Scientist, but also as part of a team that’s building an experimentation platform at Statsig. Everyone here came from companies that invested heavily into experimentation, and we’ve seen the valuable insights that experiments can drive.

For each of the use cases above, I highlighted the names of the Statsig tools I’d choose. I’d encourage you to check out the demo site if you’re interested in learning more from your experimentation!


Try Statsig Today

Explore Statsig’s smart feature gates with built-in A/B tests, or create an account instantly and start optimizing your web and mobile applications. You can also schedule a live demo or chat with us to design a custom package for your business.

MORE POSTS

Recently published

Quant vs. Qual

MARGARET-ANN SEGER

💡 How to decide between leaning on data vs. research when diagnosing and solving product problems Four heuristics I’ve found helpful when deciding between data vs. research to diagnose + solve a problem. Earth image credit of Moncast Drawing. As a PM, data...

Read more

The Importance of Default Values

TORE

Have you ever sent an email to the wrong person? Well I have. At work. From a generic support email address. To a group of our top customers. Facepalm. In March of 2018, I was working on the games team at Facebook. You may remember that month as a tumultuous...

Read more
ANNOUNCEMENT

CUPED on Statsig

CRAIG

Run experiments with more speed and accuracy We’re pleased to announce the rollout of CUPED for all our customers. Statsig will now automatically use CUPED to reduce variance and bias on experiments’ key metrics. This gives you access to a powerful experiment...

Read more

Culture of Experimentation

ANU SHARMA

You Can’t Invent Without Experimenting When Amazon launched Home Services, the team was convinced that most people want to schedule home installations in the mornings, evenings, or weekends. This naturally constrained the number of available time slots, and...

Read more

Leading a team of lions

ANU SHARMA

Training your team to make independent decisions Image Courtesy: The New Yorker “It was like the debate of a group of savages as to how to extract a screw from a piece of wood. Accustomed only to nails, they had made one effort to pull out the screw by main...

Read more

Why do my Facebook Groups look different?

TORE

Photo by Joshua Hoehne on Unsplash By now, most people realize that when they open Facebook or Instagram on their phone, their experience is very different than the person next to them. It goes deeper than just the content that you see, and the ranking...

Read more

We use cookies to ensure you get the best experience on our website.

Privacy Policy