You understand that failing to plan the test duration can lead to underpowered tests and inflated false positive rates due to peeking. Recently, you've been introduced to CUPED, an advanced statistical method that reduces KPI variance, resulting in more sensitive tests (lower MDE) or shorter test durations (lower sample size). You prefer the latter.
If this description fits you, you’re in the right place 😊
In this post we’ll learn how to use CUPED in the planning phase to reduce sample size while maintaining statistical power and MDE. Pretty wild, right?
Test planning is crucial in controlled experiments to ensure accurate and reliable results. Proper planning of sample size helps avoid underpowered tests, which can fail to detect real effects. Proper planning also reduces the risk of inflated false positive rates when not using sequential testing methods, because when you do not determine the experiment sample size you tend to peek at the results more than once. Here's the formula for calculating sample size for a t-test for two means:
$$ n = \left(\frac{\sigma \left(Z_{\alpha/2} + Z_{\beta}\right)}{\Delta}\right)^2 $$
n: Sample size for each group, assuming equal allocation
\( Z_{\alpha/2} \): Z-value for the desired significance level, (assuming two-tailed hypothesis)
\( Z_{\beta} \): Z-value for the desired power
\( \Delta \): Minimum Detectable Effect (MDE) in absolute terms, which is the difference between two means
\( \sigma \): Standard deviation
Notice that the variance, \(\sigma^2\), is directly proportional to sample size. Therefore, reducing variance by 40% means you need 40% less sample size. For instance, if you have Original Sample Size = n and Reduced Variance = 0.6\(\sigma^2\), then your new required sample size will be: 0.6n, then your new required sample size will be: 0.6n.
CUPED (Controlled Utilization of Pre-Existing Data) is a technique designed to leverage pre-existing data to reduce variance in experiment KPIs, thereby enhancing test sensitivity. By adjusting for variability unrelated to the experiment, CUPED reduces noise in the data, making it easier to detect the true effect of the experiment.
To illustrate, let's consider a business metric \( \overline{Y} \), such as revenue. In a traditional t-test, we would compare the average revenue of the control group with that of the treatment group.
With CUPED, we introduce a new variable, \( \overline{Y}_{\text{CUPED}} \), by adjusting the original metric \( \overline{Y} \) using pre-existing data \( \overline{X} \). This data \( \overline{X} \) is scaled by a constant \( \theta \), which must be determined. Instead of comparing the average revenues in a t-test, we compare the \( \overline{Y}_{\text{CUPED}} \) values using a modified variance.
The intuition behind CUPED is that the total variance of the business metric \( \overline{Y} \) can be decomposed into two components: the portion attributed to the variance of the control variate \( \overline{X} \) and the portion explained by other unknown factors.
Here's the CUPED formula:
\[ \overline{Y}_{\text{CUPED}} = \overline{Y} - \theta \times \overline{X} \]
Where:
\(\overline{Y}\): The business metric average during the experiment
\(\overline{X}\): Pre-existing data (e.g., business metric average from a pre-experiment period)
\(\theta\): The constant. Variance reduction optimality is achieved when \(\theta = \frac{\text{Cov}(X,Y)}{\text{Var}(X)}\)
The variance of \(\overline{Y}_{\text{CUPED}}\) is reduced by the square of the Pearson correlation between \(\overline{X}\) and \(\overline{Y}\):
\[ \text{Var}\left(\overline{Y}_{\text{CUPED}}\right) = \text{Var}(\overline{Y})\left(1 - \rho^2\right) \]
Where \(\rho\) is the Pearson correlation coefficient. For instance, if \(\rho = 0.9\), variance is reduced by \(0.9^2 = 0.81\). Reducing the sample size by 80% is a huge achievement, and it indeed happens in practice. The higher the correlation, the bigger the variance reduction. This technique is powerful, especially when testing on existing users with rich historical data.
Example:
In reality, we don't know the true values of the variables, so we must estimate them.
Suppose you have KPI data for a website's page views. In the pre-experiment period, the average page views per user \(\overline{X}\) is 1000. During the experiment period, the page views per user \(\overline{Y}\) is 1100. The sample covariance between \(\overline{X}\) and \(\overline{Y}\) is 200, and the sample variance of \(\overline{X}\) is 250.
Estimate \(\theta\):
\[ \hat{\theta} = \frac{200}{250} = 0.8 \]
Using CUPED, the adjusted KPI for the experiment period would be:
\[ \overline{Y}_{\text{CUPED}} = 1100 - 0.8 \times 1000 = 300 \]
CUPED is particularly useful in situations where there is a lot of variability in the business metric that is not related to the treatment effect, but can be explained by pre-existing data. By using pre-existing data to adjust for this variability, CUPED can help to isolate the true effect of the treatment, leading to more reliable and accurate results. This is especially important in A/B testing, where the goal is to detect small differences between treatment and control groups.
There’s no definitive rule for determining when pre-existing data is sufficient for CUPED. In my view, a Pearson correlation above 0.5 (or lower than -0.5), which corresponds to a 25% reduction in sample size, serves as a good rule of thumb.
Moreover, CUPED can be used in conjunction with other experimental techniques to further improve the efficiency and accuracy of the test. For example, it can be combined with sequential testing, which also reduces test duration by sequentially analyzing the data and applying an optional stopping rule.
Another advantage of CUPED is that it can be implemented relatively easily in most experimental settings. The required pre-existing data is often readily available, and the calculations needed to implement CUPED are straightforward and can be performed using standard statistical software packages.
Integrating CUPED into test planning involves the following steps:
Thus, the required sample size using CUPED is 190.
Planning is essential for accurate and reliable A/B testing, and CUPED is a powerful technique that can significantly reduce test durations by reducing variance. By understanding and applying CUPED during the planning phase, you can maintain statistical power and MDE while requiring fewer samples.
CUPED not only enhances the sensitivity of your tests but also helps in resource optimization by reducing the number of subjects needed for experiments. This makes it an invaluable tool in the arsenal of any data scientist or analyst involved in experimental design and analysis.
Today, CUPED is widely implemented in A/B testing platforms like Eppo and Statsig, further demonstrating its value in streamlining and improving the efficiency of experimental processes.
As you incorporate CUPED into your test planning, you'll find that your experiments become more efficient, reliable, and impactful.
And if you'd like to start implementing CUPED into your own experiments, create a free Statsig account and dive in.
From continuous integration and deployment to a scrappy, results-driven mindset, learn how we prioritize speed and precision to deliver results quickly and safely Read More ⇾
The Statsig <> Azure AI Integration is a powerful solution for configuring, measuring, and optimizing AI applications. Read More ⇾
Take an inside look at how we built Statsig, and why we handle assignment the way we do. Read More ⇾
Learn the takeaways from Ron Kohavi's presentation at Significance Summit wherein he discussed the challenges of experimentation and how to overcome them. Read More ⇾
Learn how the iconic t-test adapts to real-world A/B testing challenges and discover when alternatives might deliver better results for your experiments. Read More ⇾
See how we’re making support faster, smarter, and more personal for every user by automating what we can, and leveraging real, human help from our engineers. Read More ⇾