Standard deviation (and its square, variance) are extremely common measurements used to understand how much variation of “spread” an observed variable has.
In short, a low standard deviation means that observations cluster around a dataset’s mean, and a high standard deviation mans that observations are spread out in a wide around around a dataset’s mean.
The reason this is such a common and powerful measurement is that, in conjunction with the central limit theorem and a few assumptions, we can use standard deviation as a way to—at a high level—quantify the probability of a given observation set happening.
This lets us establish confidence intervals in polling and quality measurements, run A/B tests for hypothesis testing with quantifiable probabilities, assess risk in outcomes and markets, and better understand the world around us.
In the United States, the average height for adult men is 69 inches, with a standard deviation around 3 inches. Knowing these two numbers, we can estimate that only around 2.5% of men are taller than 75 inches (two standard deviations) in the United States.
Polling. If we ask random people if they support a policy or not, we can calculate the variance of their 1/0 values and construct an interval that we’re 95% confident the true population “support rate” is in
If we are producing a beverage and want to make sure the cans have 12 ounces of fluid, we can randomly sample 100 bottles of every 100,000 and measure the volume in the can. If the standard deviation is 0.05 oz, we can calculate that if the mean is 11.92oz, we expect only 1% of cans to have 11.8oz or less.
If we take a population of 10000 users, split them into even groups, and give half a treatment and half a placebo, we can use standard deviation to evaluate if the treatment did anything. The way this works is we:
Assume there is no treatment effect
Measure our outcome metric
Calculate the standard deviations of the populations’ metric, and the difference in mean metric values between the two groups
Calculate the probability of observing that difference in means, given the standard deviation/spread of the population metric
If the probability is low (usually 5%, but in medicine this can be as low as 0.1%), we conclude that it’s unlikely to observe this effect due to chance and there was likely a treatment effect
This is usually called a t-test!
Variance is the sum of squared differences from the mean. For each observation, we subtract the mean, multiply the result by itself, and then add all of those values up
Standard deviation is the square root of the variance in the population
Standard error is the standard deviation divided by the square root of the number of observations
Let's say this is our dataset:
4 |
7 |
3 |
10 |
Measure | Calculation | Symbol | Value |
---|---|---|---|
iᵗʰ Observation | xᵢ | 4, 7, 3, or 10 | |
Population Size | n | 4 | |
Mean | Σxᵢ / n | μ | 24/4 = 6 |
Variance | Σ(xᵢ-μ)² | σ² | (-2)² + 1² + (-3)² + (4)² = 30 |
Standard Deviation | √σ² | σ | √30 = 5.48 |
Standard Error | σ/√n | sₓ or σₓ | 5.48/√4 = 2.74 |
Generally the true population standard deviation/variance is unknown, so we infer it from our sampled data, and call it the Sample Standard Deviation
In all of the use cases above (and more), we are trying to use statistics to inform a decision, and standard deviation represents uncertainty.
A high standard deviation means we cannot make strong conclusions. For example, if the standard deviation in our a/b test was close to infinity, any observation would be reasonably likely. Conversely, if the standard deviation was near 0, any change in mean could reasonably be attributed to our experimental treatment.
Because of this, modern measurement tools seek to reduce standard deviation in order to allow people to make decisions with more certainty.
For the examples below, we’ll apply techniques to this dataset, representing user revenue data from a 50/50 a/b test where the test did lead to an increase in revenue.
Without any adjustments, the p-value is 0.106, so there’s a 10.6% chance of observing this effect or greater. Typically this isn’t strong enough evidence to conclude a real effect.
It’s well known that outliers can heavily skew mean values.
For example, if you have a population with 100 users, and 50 have a value of 4.5, 49 have a value of 5.5, and 1 has a value of 10,005, the average value is 105. The outlier has dominated the mean, since without them the population average is 5.
Similarly, the standard deviation of that dataset is 999.5; without the outlier user, it is 0.5. Since there is a squared term in variance, extreme values can blow up the measurement very easily.
Because of this, it’s very common to deal with outliers in a number of ways:
The easiest way to deal with outliers is to remove them from the dataset. However, this can cause issues since outliers do provide meaningful information. This is only recommended if you can confirm that the outliers are “bad” or “buggy” data.
The most common way to deal with outliers is winsorization. This practice involves identifying the metric value at the Nth percentile of your population, and setting any values over that to that value. In the example above, the 98th percentile would be 5.5, so we’d just turn the 10,005 user into a 5.5 if winsorizing at 99%.
Similarly to filtering, this does remove some information from the dataset - but in a less extreme way! Generally, for experimentation, we recommend winsorizing at 99.9%, which keeps most users in the analysis but prevents extreme tails.
In the case above, we only reduced standard deviation a small amount with 99% winsorization, that’s because this dataset is well-behaved and doesn’t have really extreme outliers. Our p-value went from 0.1059 to 0.0519 (which is still a notable improvement!)
Let’s try this on a dataset with extreme outliers - in this case almost certainly a logging bug.
We reduced our p-value from 0.62 to 0.51, with standard deviation going down by 91%.
Capping is another way to deal with outliers. This is similar to winsorization, but more manual; generally, you set a user-day value that you think is a reasonable cap, and set any values over the cap to the cap.
For example, an online book-seller might choose to set a cap of 10 books/day in their measurement to avoid their experiments being overly influenced by a collector or library purchasing 10,000 books in a day.
Let’s cap the outlier dataset at 100. This reduces standard deviation by 99%—which makes sense, since it’s similar to winsorizing and we used a more aggressive threshold.
CUPED—or regression adjustment—is a tool by which we explain away some of the variance in a dataset by using other explanatory variables besides “random chance”.
Without regression adjustment, we attribute what we see in our population to random noise, plus our treatment effect (in an a/b test). This isn’t true, though. For example, height is not totally random; your parents’ height is a very strong predictor of your adult height.
In that example, if we control for parent height, we’re taking out a “random variable” from our variance and will end up with less variance in our dataset:
Father’s height (in) | Adult Male Height (in) | Adjusted (naive subtraction) | Naive Variance | Adjusted Variance |
---|---|---|---|---|
63 | 65 | 2 | 7.84 | 2.64 |
75 | 73 | -2 | ||
70 | 72 | 2 | ||
70 | 69 | -1 | ||
68 | 69 | 1 |
In practice, this is a bit more complicated. Most notably, the explanatory variable here isn’t a perfect predictor, so we want to run a “regression” on the data points to build a better adjustment. In this case a basic OLS regression would be h = 22.973 + 0.674(ph)
Father’s height (in) | Adult Male Height (in) | Adjusted (use residual, or value - prediction) | Naive Variance | Adjusted Variance |
---|---|---|---|---|
63 | 65 | -0.42 | 7.84 | 1.05 |
75 | 73 | -0.51 | ||
70 | 72 | 1.86 | ||
70 | 69 | -1.14 | ||
68 | 69 | 0.21 |
You can also use other variables. For example, parent income or country may provide additional information. If you’re curious, we recommend reading our article on CUPED or checking the resources at the end of this article.
In our example dataset, we see the following transformation, which reduces the standard deviation by 80%; the p-value goes from 0.1059 to <0.0001.
Thresholding is the practice of creating 1/0 user-level flags if their metric value passes a certain threshold. This trades information detail for power.
If we convert revenue to a 1/0 flag for if a user spent >$10 during our experiment, this reduces our p-value to 0.0026; we have very strong evidence for the hypothesis that more users spent at least $10!
Note that comparing standard deviation makes less sense in this case, since we’ve also shifted the mean. If we compare the coefficient of variation.
Participation is simply a threshold metric with the threshold at 0. This is common since, intuitively, it just means that the user did an action. This includes metrics like “Purchases”, “Subscribers”, or other critical actions you might care that a user did at least once.
These can also be turned into WAU/DAU/Retention metrics—a “rolling” participation.
In the previous example, we get a p-value of 0.0022, down from 0.1059.
On Statsig, we typically encourage experimenters to use CUPED in conjunction with winsorization. Combining these, we get an extremely low p-value of 4.608e-09 and can clearly reject the null hypothesis that our treatment didn’t move revenue.
There are surprisingly few guides on this topic, which is critical in many fields.
If you’re interested in learning more, here are some helpful resources:
We’ve expanded our SRM debugging capabilities to allow customers to define custom user dimensions for analysis. Read More ⇾
Detect interaction effects between concurrent A/B tests with Statsig's new feature to ensure accurate experiment results and avoid misleading metric shifts. Read More ⇾
Statsig's biggest year yet: groundbreaking launches, global events, record scaling, and exciting plans for 2025. Explore our 2024 milestones and what’s next! Read More ⇾
A guide to reporting A/B test results: What are common mistakes and how can you make sure to get it right? Read More ⇾
This guide explains why the allocation point may differ from the exposure point, how it happens, and what you to do about it. Read More ⇾
Understand the difference between one-tailed and two-tailed tests. This guide will help you choose between using a one-tailed or two-tailed hypothesis! Read More ⇾