Correlation occurs when there is a relationship between the values of some variable with the values of some other bariable. This is contrasted by independence, where there is no relationship between the values of two variables.
Correlation can be described as being weak or strong depending on how poorly or well one measurement can predict the value of the other. This is often measured by the Pearson Correlation Coefficient, which describes the linear relationship between two variables. For variables X and Y this measure is:
\[ \textit{corr}(X, Y) = \frac{\textit{cov}(X, Y)}{\sigma_X \sigma_Y} \]
In an experimental context, you might care about several different types of correlated metrics:
Metric families, metrics that measure the same/similar phenomena - e.g. a total spend per user, total revenue per buyer, and a 0/1 indicator for purchasing
Surrogate metrics, metrics that are a leading indicator or another metric - e.g. total spend in the first 7 days a user is active may be predictive of total spend in the following 6 months
Intrinsic metrics, metrics that tend to take certain values among a certain “type” of experimental unit - e.g. total spend in the first 7 days may be correlated with the median salary in a users zip code
These all can serve different functions in an experiment
Experiments are not only helpful in determining what to do but also help determine why interventions are effective. Metric Families can help you more deeply understand the why behind changes in user behaviors in your product.
Let’s say that your change increased overall revenue per user. The mechanism for this change could be by improving the rate of users who make purchases or by increasing the amount most users spend. These metrics are correlated, but they can help flesh out the whole picture for what is happening behind a given experimental result.
Often times the most important business metrics are calculated over a long period of time and can be hard to move over the duration of an experiment. Surrogate Metrics are used to predict the impact of the high-level long-term business metric from data points that are known earlier in an experiment.
Let’s say that your change increased user spend in the first 7 days they’re active on the platform, which is positively correlated with user spend in the next 6 months. This may be a good predictive metric (or part of a more complicated prediction model) for estimating a more long term impact.
Note that usage of Surrogate Metrics also requires for prediction error to be accounted for in reading experimental restuls.
No one wants to make an experiment decision based on a false positive. Intrinsic metrics which are related at the unit level but aren’t impacted the same by an experimental treatment can help serve as a guardrail for detecting false positives.
In an AA test, where no real intervention has taken place, correlated metrics will tend to be incorrect at the same time. This is why intrinsic metrics can make good guardrails.
Let’s say that your change increased overall revenue per user, and revenue per user is typically related to the median salary in their zip code. If your change also increased the median salary in a users zip code, that’s an indication that this result may be a false positive. That’s because there’s not a reasonable mechanism for an online experiment treatment to cause a user to move to a new zip code or for their entire zipcode’s salaries to increase.
Correlated metrics can be troublesome are when trying to use a multiple comparison correction. Many multiple comparison corrections like the Bonferroni Correction and Benjamini-Hochberg Procedure assume measurements are independent of each other.
When metrics are positively correlated, multiple comparison corrections will overly-penalize the addition of more metrics. These methods still provide an upper bound for the possible error rates, there’s just more loss of statistical power than necessary in these methods.
Conversely, when negatively correlated, the guaranteed error rates from different multiple comparison corrections will not actually be valid.