Products

Solutions

Resources

Docs Pricing

Products

Solutions

Resources

Products

Solutions

Resources

Introducing surrogate metrics

Mon May 12 2025

You don’t need to wait a year to know if your feature worked—surrogate metrics let you move fast and stay anchored to long-term outcomes that actually matter.

Statsig now supports the use of surrogate metrics in experiments. If you bring the predicted values and mean squared error (MSE) of the model used to make these predictions, we’ll accurately account for the model’s prediction error when calculating p-values and confidence intervals.

Why are surrogate metrics helpful?

Often, the most important metrics to determine a feature’s success are hard to move and materialize over the long-term. This creates a tension when we want to move fast with our experimentation program, and mitigate the opportunity cost of delaying launching good features BUT want to make decisions based on these long-term high-level metrics.

Surrogate metrics (also called proxy or predictive metrics) estimate a long-term outcome that’s impractical to measure during an experiment. They help us move fast while staying oriented toward the metrics we care most about, offering quicker feedback with greater sensitivity to change.

Statsig doesn’t create surrogate metrics for you, we’re relying on you generating unit level data based on a model that you create and maintain. When you create a surrogate metric, you’ll be able to input the mean squared error (MSE) of your model, so that we can accurately account for the error associated with the prediction.

Surrogate metric best practices

Statsig doesn’t create surrogate metrics for you, we’re relying on you generating unit level data based on a model that you create and maintain. We recommend following some general best practices for predictive machine learning:

1. Validate your predictive model and monitor over time

Having a very accurate model is not enough. There are many modeling best practices that become especially important when used to create a surrogate metric for experimentation.

General Modeling Best Practices:

Inputs should be independent of assignment. Assignment to any given experiment group should be random and not correlated to any input to the predictive model.
Inputs should have no leakage from covariates observed in a post-estimation time period. If your model accidentally uses data that would only be known after the prediction is made, it’s not feasible to deploy this model in the real world, because we would have no input for any leaked covariates.
Your model should be an unbiased estimator. The error term when comparing your surrogate metric to the true north metric must be zero. When a model is biased in one direction, results could be particularly misleading based on artificially inflated or deflated predictions.
Outputs should not exhibit heteroscedasticity. For each predicted value, the prediction and the expected magnitude of the error term should not be correlated.

Over time, product changes can improve or degrade the quality of prediction that a particular surrogate model produces. It’s important to continuously monitor the efficacy of surrogates

More resources on modeling best practices:

2. Measure non-surrogate metrics in experiments too

When you’re running an experiment with surrogate metrics, your suite of evaluation metrics should also include measurements of the mechanism that you expect to use to move your true north metric.

For example, let’s say you true north metric is the revenue generated in the next year. You design an experiment that gives tenured users a small discount on any purchase. The hypothesis is that, even though revenue per purchase may drop, the policy could increase overall annual revenue by boosting retention and encouraging more frequent purchases.

During the experiment, you should track the surrogate metric that estimates annual revenue. But you should also measure short-term retention, purchase frequency, and revenue per purchase. These metrics help you understand the specific behaviors you're influencing, alongside the predicted change in your long-term topline revenue.

Misalignment of incentives can happen with any metric, but it can be particularly pernicious when surrogate metrics are being used, since the predictive model serves as a layer of abstraction that can obfuscate when an input is being inadvertently gamed. Having non-surrogate and mechanistic metrics alongside your surrogate can help clarify the real impacts your observing and how that might lead to the predicted outcome in your surrogate metric.

3. Use holdouts to validate decisions made based on surrogate metrics

Holdouts are a great way to validate the results of individual experiments. When using surrogate metrics, holdouts can be particularly important as an opportunity to measure the true north metric that your surrogate predicts.

Related: Check out our article about holdouts for a refresher

Even if your main decision-making process uses fast and lightweight surrogate metrics, holdouts ensure you're not being misled by short-term or noisy signals by evaluating decisions in aggregate and/or over a longer time period.

Getting started on Statsig

On Statsig, surrogate metrics can be created as “Latest Value” type metrics in Warehouse Native projects. When setting them up, you’ll need to provide the Mean Squared Error (MSE) of your model so we can correctly account for prediction error in the results.

Experiment results for surrogate metrics will display a confidence interval that takes into account both the variance observed among units in the different variants of the experiment, but also the error associated with the prediction.

How do surrogate metrics work on Statsig?

Since surrogate metrics are a prediction, there is some inherent error associated with accuracy of the prediction. In order to avoid an inflation of the number of false positives in our results, we need adjust our p-value and confidence interval calculations accordingly.

Consider the variable 𝑋 to be the true north metric which is being predicted by the surrogate metric 𝑆. The surrogate metric 𝑆 is assumed to be an unbiased estimator with an error term 𝜖 without heteroscedasticity (such that 𝑆 and 𝜖 are independent).

MathJax Equation

$${X} = \mathcal{S} + \epsilon$$

Since 𝑆 is an unbiased estimator we know that

$$\mathbb{E}[\epsilon] = 0$$

and

$$\mathrm{Var}(\epsilon) = \mathbb{E}[\epsilon^2 - \mathbb{E}[\epsilon]^2] = \mathbb{E}[\epsilon^2] = \mathrm{MSE}$$

The variance of 𝜖 is the same as the mean squared error (𝐌𝐒𝐄).

Thus, we can calculate the mean of 𝑋 as:

$$\mu_X = \mathbb{E}[S + \epsilon] = \bar{S} + 0 = \bar{S}$$

This means that we can use the same mean lift observed in the experiment to represent our true north metric’s mean lift.

We can use the definition of our true north variable 𝑋 and 𝐌𝐒𝐄 to calculate variance for our true north metric 𝑋:

$$\mathrm{Var}(X) = \mathrm{Var}(S + \epsilon) = \mathrm{Var}(S) + \mathrm{MSE}$$

These values calculated for the mean and variance of the true north metric 𝑋 can then be utilized in our typical process for calculating a p-value and confidence interval.

Other resources discussing surrogate metrics

Looking for a smarter way to ship?

Statsig combines enterprise-grade feature flags with your product metrics, helping you ship fast, without breaking things

Book a demo now

Permalink: https://www.statsig.com/blog/introducing-surrogate-metrics

Products

Solutions

Resources

Products

Solutions

Resources

Docs

Pricing

Back to Blog home

Liz Obermaier

Introducing surrogate metrics

You don’t need to wait a year to know if your feature worked—surrogate metrics let you move fast and stay anchored to long-term outcomes that actually matter.

Why are surrogate metrics helpful?

Surrogate metric best practices

1. Validate your predictive model and monitor over time

2. Measure non-surrogate metrics in experiments too

3. Use holdouts to validate decisions made based on surrogate metrics

Getting started on Statsig

How do surrogate metrics work on Statsig?

Other resources discussing surrogate metrics

Looking for a smarter way to ship?

Recent Posts

Optimizing cloud compute costs with GKE and compute classes

Pablo Beltran

How Statsig lets you ship, measure, and optimize AI-generated code

Sid Kumar, Brock Lumbard

Your users are your best benchmark: a guide to testing and optimizing AI products

Skye Scofield

The more the merrier? The problem of multiple comparisons in A/B Testing

Allon Korem, Oryah Lancry-Dayan

Randomization: The ABC’s of A/B Testing

Allon Korem, Oryah Lancry-Dayan

Speeding up A/B tests with discipline

Yuzheng Sun, PhD