Today we’re going to walk through some thoughts on how to approach picking metrics for an experiment. This isn’t exhaustive, but it should lay solid groundwork for setting yourself up for success with experimentation.
Let’s start with a short video, and discuss!
In the video above, we gave advice on how to approach picking the right metrics for an experiment (or conversely, how to avoid picking bad metrics!)
This will always require some thought, and understanding of the product changes you’re trying to achieve, but with practice it becomes second nature to have a strong, measurable hypothesis and a portfolio of secondary metrics to monitor.
Here’s my mental card when I get started:
Let’s walk through this together!
Your primary evaluation should be on 1–2 metrics that tie directly into a specific hypothesis. Our suggested approach is to start by breaking what you’re testing into two pieces:
This should be an immediate outcome of your change, and generally should be a necessary (but not sufficient) outcome for your experiment to succeed. We would call this a “mechanical” or “behavioral” metric. Answer the question “what is the first thing I would see if my experiment worked?”
Examples: clicks, 60-second page views, conversions.
Generally, you’re making a change in order to help drive a high-level business outcome; this metric represents that outcome and answers “why” you’re making this change. Sometimes this can’t be measured directly (“Satisfaction”) but can be measured through a proxy metric (“Retention”).
Pick an appropriate scope for this metric. A UI tweak in a menu likely won’t affect revenue enough to be observable, but you might try to measure a low-level strategic outcome like reducing “time to menu click”.
Breaking our hypothesis down this way allows us to clearly state it in terms of what we’re driving, and what we expect that to lead to.
I think by making the cart button bigger, we can drive more clicks on it. I think this might drive more revenue through more checkouts.
Your business outcome doesn’t need to follow immediately from the behavioral outcome, but you should be able to explain how you’d get from the behavioral metric to a change in the business metric. This is a good way to validate that your hypothesis makes sense in the first place.
This is also useful for framing your analysis. If you don’t see your business outcome move, you already have the first step for diagnosing this built-in:
Once you’ve done the above, congrats! You have a clear and measurable hypothesis. However, products are complex. It’s usually a mistake to see a positive result and assume the result is an unmitigated victory. You can proceed with much more confidence after thinking through what your secondary metrics should be.
Consider the ways that your change could lead to a worse user experience. How could you try to measure if that’s happening? This is how you develop good counter-metrics, which you check to make sure you’re not accidentally causing regressions.
Common examples:
Sometimes, you can help yourself by picking a business metric that addresses these counter-metrics. Let’s say you’re a subscription service and want to increase conversion, but you’re worried that you’ll increase refunds as well. Tracking “Revenue Minus Refunds at Day 7”, instead of “Gross Revenue” means your business metric will be much more robust to movements in early refunds.
In addition to your mechanical metric above, you might want to add color by filling in the mathematical steps between “point A” and “point B”.
For example, if your business metric is “Revenue”, and your mechanical metric is “Checkouts”, you’d want to also track “Average Purchase Size in USD” to connect the formula of Revenue = [Checkouts * $/Checkout]
Experiments can provide valuable information outside of the scope of your hypothesis. While you need to treat incidental observations with some skepticism due to the inherent noise in experimentation, it’s a good practice to use experiment results to form new hypotheses.
For example, let’s say you’ve heard that users in country A like dropdown menus and users in country B like type-aheads. However, you’re not sure if this is true, or causal.
If you happen to have run an experiment that switches a button to a typeahead for all users, you can split the results by country to get some causal signal on if users in those countries truly have different preferences. It’s not your main hypothesis, but it’s an important learning! Looking for these free insights is a great way to get more value out of your data.
Once you’ve done this, You should be able to state the results of this exercise in plain language:
Do
Don’t
Explore Statsig’s smart feature gates with built-in A/B tests, or create an account instantly and start optimizing your web and mobile applications. You can also schedule a live demo or chat with us to design a custom package for your business.
💡 How to decide between leaning on data vs. research when diagnosing and solving product problems Four heuristics I’ve found helpful when deciding between data vs. research to diagnose + solve a problem. Earth image credit of Moncast Drawing. As a PM, data...
Have you ever sent an email to the wrong person? Well I have. At work. From a generic support email address. To a group of our top customers. Facepalm. In March of 2018, I was working on the games team at Facebook. You may remember that month as a tumultuous...
Run experiments with more speed and accuracy We’re pleased to announce the rollout of CUPED for all our customers. Statsig will now automatically use CUPED to reduce variance and bias on experiments’ key metrics. This gives you access to a powerful experiment...
You Can’t Invent Without Experimenting When Amazon launched Home Services, the team was convinced that most people want to schedule home installations in the mornings, evenings, or weekends. This naturally constrained the number of available time slots, and...
Training your team to make independent decisions Image Courtesy: The New Yorker “It was like the debate of a group of savages as to how to extract a screw from a piece of wood. Accustomed only to nails, they had made one effort to pull out the screw by main...
Photo by Joshua Hoehne on Unsplash By now, most people realize that when they open Facebook or Instagram on their phone, their experience is very different than the person next to them. It goes deeper than just the content that you see, and the ranking...