Standard deviation (and its square, variance) are extremely common measurements used to understand how much variation of “spread” an observed variable has.
In short, a low standard deviation means that observations cluster around a dataset’s mean, and a high standard deviation means that observations are spread out in a wide range around a dataset’s mean.
The reason this is such a common and powerful measurement is that, in conjunction with the central limit theorem and a few assumptions, we can use standard deviation as a way to—at a high level—quantify the probability of a given set of observations happening.
This lets us establish confidence intervals in polling and quality measurements, run A/B tests for hypothesis testing with quantifiable probabilities, assess risk in outcomes and markets, and better understand the world around us.
In the United States, the average height for adult men is 69 inches, with a standard deviation around 3 inches. Knowing these two numbers, we can estimate that only around 2.5% of men are taller than 75 inches (two standard deviations) in the United States.
Polling. If we ask random people if they support a policy or not, we can calculate the variance of their 1/0 responses and construct an interval that we’re 95% confident the true population “support rate” lies within.
If we are producing a beverage and want to make sure the cans have 12 ounces of fluid, we can randomly sample 100 bottles of every 100,000 and measure the volume in the can. If the standard deviation is 0.05 oz, and the mean is 11.92oz, we expect only 1% of cans to have 11.8oz or less.
If we take a population of 10000 users, split them into even groups, and give half a treatment and half a placebo, we can use standard deviation to evaluate if the treatment did anything. The way this works is we:
Assume there is no treatment effect
Measure our outcome metric
Calculate the standard deviations of the populations’ metric, and the difference in mean metric values between the two groups
Calculate the probability of observing that difference in means, given the standard deviation/spread of the population metric
If the probability is low (usually 5%, but in medicine this can be as low as 0.1%), we conclude that it’s unlikely to observe this effect due to chance and there was likely a treatment effect
This practice is commonly called a/b testing or randomized controlled trials, and usually evaluated using a t-test.
Variance is the average of squared differences from the mean. For each observation, we subtract the mean, multiply the result by itself, and then add all of those values up
Standard deviation is the square root of the variance in the population
Standard error is the standard deviation divided by the square root of the number of observations
Let's say this is our dataset:
4 |
7 |
3 |
10 |
Measure | Calculation | Symbol | Value |
---|---|---|---|
iᵗʰ Observation | xᵢ | 4, 7, 3, or 10 | |
Population Size | n | 4 | |
Mean | Σxᵢ / n | μ | 24/4 = 6 |
Variance | Σ(xᵢ-μ)² / n | σ² | ((-2)² + 1² + (-3)² + (4)²)/4 = 7.5 |
Standard Deviation | √σ² | σ | √7.5 = 2.74 |
Standard Error | σ/√n | sₓ or σₓ | 2.74/√4 = 1.37 |
Generally the true population standard deviation/variance is unknown, so we infer it from our sampled data, and call it the Sample Standard Deviation
In all of the use cases above (and more), we are trying to use statistics to inform a decision, and standard deviation represents uncertainty.
When standard deviation is high, we need to see an extreme result in order to form a conclusion. For example, if the standard deviation in our a/b test was close to infinity, any observation would be reasonably likely to occur by chance . Conversely, if the standard deviation was near 0, any change in mean could reasonably be attributed to our experimental treatment.
Because of this, modern measurement tools seek to reduce standard deviation in order to allow people to make decisions with more certainty.
For the examples below, we’ll apply techniques to this dataset, representing user revenue data from a 50/50 a/b test where the test did lead to an increase in revenue.
Without any adjustments, the p-value is 0.106, so there’s a 10.6% chance of observing this effect or greater. Typically this isn’t strong enough evidence to conclude a real effect.
It’s well known that outliers can heavily skew mean values.
For example, if you have a population with 100 users, and 50 have a value of 4.5, 49 have a value of 5.5, and 1 has a value of 10,005, the average value is 105. The outlier has dominated the mean, since without them the population average is 5.
Similarly, the standard deviation of that dataset is 999.5; without the outlier user, it is 0.5. Since there is a squared term in variance, extreme values are even more influential in its calculation.
Because of this, it’s very common to deal with outliers in a number of ways:
The easiest way to deal with outliers is to remove them from the dataset. However, this is usually not a good solution, since they do provide meaningful information. This is only recommended if you can confirm that the outliers are “bad” or “buggy” data.
The most common way to deal with outliers is winsorization. This practice involves identifying the metric value at the Nth percentile of your population, and setting any values over that to that value. In the example above, the 98th percentile would be 5.5, so we’d just turn the 10,005 user into a 5.5 if winsorizing at 98%.
Similarly to filtering, this does remove some information from the dataset, but in a less extreme way. Generally, for experimentation, we recommend winsorizing at 99.9%, which keeps most users at their original metric value, but prevents extreme tails.
In the case above, we only reduced standard deviation a small amount with 98% winsorization, that’s because this dataset is well-behaved and doesn’t have really extreme outliers. Our p-value went from 0.1059 to 0.0519 (which is still a notable improvement!)
Let’s try this on a dataset with extreme outliers - in this case almost certainly a logging bug.
We reduced our p-value from 0.62 to 0.51, with standard deviation going down by 91%.
Capping is another way to deal with outliers. This is similar to winsorization, but more manual; generally, you set a user-day value that you think is a reasonable cap, and set any values over the cap to the cap.
For example, an online book-seller might choose to set a cap of 10 books/day in their measurement to avoid their experiments being overly influenced by a collector or library purchasing 10,000 books in a day.
Let’s cap the outlier dataset at 100. This reduces standard deviation by 99%—which makes sense, since it’s similar to winsorizing and we used a more aggressive threshold.
CUPED—or regression adjustment—is a tool by which we explain away some of the variance in a dataset by using other explanatory variables besides “random chance”.
Without regression adjustment, we attribute what we see in our population purely to random noise. This isn’t true, though. For example, height is not totally random; your parents’ height is a very strong predictor of your adult height.
In that example, if we control for parent height, we’re taking out a “random variable” from our variance and will end up with less variance in our dataset:
Father’s height (in) | Adult Male Height (in) | Adjusted (naive subtraction) | Naive Variance | Adjusted Variance |
---|---|---|---|---|
63 | 65 | 2 | 7.84 | 2.64 |
75 | 73 | -2 | ||
70 | 72 | 2 | ||
70 | 69 | -1 | ||
68 | 69 | 1 |
In practice, this is a bit more complicated. Most notably, the explanatory variable here isn’t a perfect predictor, so we can run a regression on the data points to build a better adjustment. In this case a basic OLS regression would be h = 22.973 + 0.674(ph)
Father’s height (in) | Adult Male Height (in) | Adjusted (use residual, or value - prediction) | Naive Variance | Adjusted Variance |
---|---|---|---|---|
63 | 65 | -0.42 | 7.84 | 1.05 |
75 | 73 | -0.51 | ||
70 | 72 | 1.86 | ||
70 | 69 | -1.14 | ||
68 | 69 | 0.21 |
You can also use other variables. For example, parent income or country may provide additional information. If you’re curious, we recommend reading our article on CUPED or checking the resources at the end of this article.
In our example dataset, we see the following transformation, which reduces the standard deviation by 80%; the p-value goes from 0.1059 to <0.0001.
Thresholding is the practice of creating 1/0 user-level flags if their metric value passes a certain threshold. This trades information detail for power.
If we convert revenue to a 1/0 flag for if a user spent >$10 during our experiment, this reduces our p-value to 0.0026; we have very strong evidence for the hypothesis that more users spent at least $10!
Note that comparing standard deviation makes less sense in this case, since we’ve also shifted the mean.
Participation is simply a threshold metric with the threshold at 0. This is common since, intuitively, it just means that the user did an action. This includes metrics like “Purchases”, “Subscribers”, or other critical actions you might care that a user did at least once.
These can also be turned into WAU/DAU/Retention metrics—a “rolling” participation.
In the previous example, we get a p-value of 0.0022, down from 0.1059.
On Statsig, we typically encourage experimenters to use CUPED in conjunction with winsorization. Combining these, we get an extremely low p-value of 4.608e-09 and can clearly reject the null hypothesis that our treatment didn’t move revenue.
There are surprisingly few guides on this topic, which is critical in many fields.
If you’re interested in learning more, here are some helpful resources:
Find out how we scaled our data platform to handle hundreds of petabytes of data per day, and our specific solutions to the obstacles we've faced while scaling. Read More ⇾
Building a scalable experimentation platform means balancing cost, performance, and flexibility. Here’s how we designed an elastic, efficient, and powerful system. Read More ⇾
The debate between Bayesian and frequentist statistics sounds like a fundamental clash, but it's more about how we talk about uncertainty than the actual decisions we make. Read More ⇾
Here's how we optimized store cloning, cut processing time from 500ms to 2ms, and engineered FastCloneMap for blazing-fast entity updates. Read More ⇾
It's one thing to have a really great and functional product. It's another thing to have a product that feels good to use. Read More ⇾
Stratified sampling enhances A/B tests by reducing variance and improving group balance for more reliable results. Read More ⇾