Understanding (and reducing) variance and standard deviation

Fri Jan 17 2025

Craig Sexauer

Data Scientist, Statsig

If numbers could gossip, standard deviation would be the one spilling all the secrets.

Standard deviation (and its square, variance) are extremely common measurements used to understand how much variation of “spread” an observed variable has.

In short, a low standard deviation means that observations cluster around a dataset’s mean, and a high standard deviation mans that observations are spread out in a wide around around a dataset’s mean.

The reason this is such a common and powerful measurement is that, in conjunction with the central limit theorem and a few assumptions, we can use standard deviation as a way to—at a high level—quantify the probability of a given observation set happening.

This lets us establish confidence intervals in polling and quality measurements, run A/B tests for hypothesis testing with quantifiable probabilities, assess risk in outcomes and markets, and better understand the world around us.

Applications

  • In the United States, the average height for adult men is 69 inches, with a standard deviation around 3 inches. Knowing these two numbers, we can estimate that only around 2.5% of men are taller than 75 inches (two standard deviations) in the United States.

  • Polling. If we ask random people if they support a policy or not, we can calculate the variance of their 1/0 values and construct an interval that we’re 95% confident the true population “support rate” is in

  • If we are producing a beverage and want to make sure the cans have 12 ounces of fluid, we can randomly sample 100 bottles of every 100,000 and measure the volume in the can. If the standard deviation is 0.05 oz, we can calculate that if the mean is 11.92oz, we expect only 1% of cans to have 11.8oz or less.

  • If we take a population of 10000 users, split them into even groups, and give half a treatment and half a placebo, we can use standard deviation to evaluate if the treatment did anything. The way this works is we:

    • Assume there is no treatment effect

    • Measure our outcome metric

    • Calculate the standard deviations of the populations’ metric, and the difference in mean metric values between the two groups

    • Calculate the probability of observing that difference in means, given the standard deviation/spread of the population metric

    • If the probability is low (usually 5%, but in medicine this can be as low as 0.1%), we conclude that it’s unlikely to observe this effect due to chance and there was likely a treatment effect

    This is usually called a t-test!

Calculations

  • Variance is the sum of squared differences from the mean. For each observation, we subtract the mean, multiply the result by itself, and then add all of those values up

  • Standard deviation is the square root of the variance in the population

  • Standard error is the standard deviation divided by the square root of the number of observations

Let's say this is our dataset:

Simple Table
4
7
3
10
Detailed Table
Measure Calculation Symbol Value
iᵗʰ Observation xᵢ 4, 7, 3, or 10
Population Size n 4
Mean Σxᵢ / n μ 24/4 = 6
Variance Σ(xᵢ-μ)² σ² (-2)² + 1² + (-3)² + (4)² = 30
Standard Deviation √σ² σ √30 = 5.48
Standard Error σ/√n sₓ or σₓ 5.48/√4 = 2.74

Generally the true population standard deviation/variance is unknown, so we infer it from our sampled data, and call it the Sample Standard Deviation

Reducing standard deviation

In all of the use cases above (and more), we are trying to use statistics to inform a decision, and standard deviation represents uncertainty.

A high standard deviation means we cannot make strong conclusions. For example, if the standard deviation in our a/b test was close to infinity, any observation would be reasonably likely. Conversely, if the standard deviation was near 0, any change in mean could reasonably be attributed to our experimental treatment.

Because of this, modern measurement tools seek to reduce standard deviation in order to allow people to make decisions with more certainty.

For the examples below, we’ll apply techniques to this dataset, representing user revenue data from a 50/50 a/b test where the test did lead to an increase in revenue.

Without any adjustments, the p-value is 0.106, so there’s a 10.6% chance of observing this effect or greater. Typically this isn’t strong enough evidence to conclude a real effect.

total revenue distribution by group in raw data

Outlier management

It’s well known that outliers can heavily skew mean values.

For example, if you have a population with 100 users, and 50 have a value of 4.5, 49 have a value of 5.5, and 1 has a value of 10,005, the average value is 105. The outlier has dominated the mean, since without them the population average is 5.

Similarly, the standard deviation of that dataset is 999.5; without the outlier user, it is 0.5. Since there is a squared term in variance, extreme values can blow up the measurement very easily.

Because of this, it’s very common to deal with outliers in a number of ways:

Filtering

The easiest way to deal with outliers is to remove them from the dataset. However, this can cause issues since outliers do provide meaningful information. This is only recommended if you can confirm that the outliers are “bad” or “buggy” data.

Winsorization

The most common way to deal with outliers is winsorization. This practice involves identifying the metric value at the Nth percentile of your population, and setting any values over that to that value. In the example above, the 98th percentile would be 5.5, so we’d just turn the 10,005 user into a 5.5 if winsorizing at 99%.

Similarly to filtering, this does remove some information from the dataset - but in a less extreme way! Generally, for experimentation, we recommend winsorizing at 99.9%, which keeps most users in the analysis but prevents extreme tails.

total revenue distribution by group in adjusted data

In the case above, we only reduced standard deviation a small amount with 99% winsorization, that’s because this dataset is well-behaved and doesn’t have really extreme outliers. Our p-value went from 0.1059 to 0.0519 (which is still a notable improvement!)

Let’s try this on a dataset with extreme outliers - in this case almost certainly a logging bug.

total revenue distribution by group in raw data
total revenue by group in adjusted data

We reduced our p-value from 0.62 to 0.51, with standard deviation going down by 91%.

Capping

Capping is another way to deal with outliers. This is similar to winsorization, but more manual; generally, you set a user-day value that you think is a reasonable cap, and set any values over the cap to the cap.

For example, an online book-seller might choose to set a cap of 10 books/day in their measurement to avoid their experiments being overly influenced by a collector or library purchasing 10,000 books in a day.

Let’s cap the outlier dataset at 100. This reduces standard deviation by 99%—which makes sense, since it’s similar to winsorizing and we used a more aggressive threshold.

total revenue distribution by group in adjusted data

CUPED (Experimentation)

CUPED—or regression adjustment—is a tool by which we explain away some of the variance in a dataset by using other explanatory variables besides “random chance”.

Without regression adjustment, we attribute what we see in our population to random noise, plus our treatment effect (in an a/b test). This isn’t true, though. For example, height is not totally random; your parents’ height is a very strong predictor of your adult height.

In that example, if we control for parent height, we’re taking out a “random variable” from our variance and will end up with less variance in our dataset:

Table with Empty Cells
Father’s height (in) Adult Male Height (in) Adjusted (naive subtraction) Naive Variance Adjusted Variance
63 65 2 7.84 2.64
75 73 -2
70 72 2
70 69 -1
68 69 1

In practice, this is a bit more complicated. Most notably, the explanatory variable here isn’t a perfect predictor, so we want to run a “regression” on the data points to build a better adjustment. In this case a basic OLS regression would be h = 22.973 + 0.674(ph)

Final Table
Father’s height (in) Adult Male Height (in) Adjusted (use residual, or value - prediction) Naive Variance Adjusted Variance
63 65 -0.42 7.84 1.05
75 73 -0.51
70 72 1.86
70 69 -1.14
68 69 0.21

You can also use other variables. For example, parent income or country may provide additional information. If you’re curious, we recommend reading our article on CUPED or checking the resources at the end of this article.

In our example dataset, we see the following transformation, which reduces the standard deviation by 80%; the p-value goes from 0.1059 to <0.0001.

total revenue distribution by group in adjusted data

Transformations

Thresholds

Thresholding is the practice of creating 1/0 user-level flags if their metric value passes a certain threshold. This trades information detail for power.

If we convert revenue to a 1/0 flag for if a user spent >$10 during our experiment, this reduces our p-value to 0.0026; we have very strong evidence for the hypothesis that more users spent at least $10!

Note that comparing standard deviation makes less sense in this case, since we’ve also shifted the mean. If we compare the coefficient of variation.

total revenue distribution by group in adjusted data

Participation

Participation is simply a threshold metric with the threshold at 0. This is common since, intuitively, it just means that the user did an action. This includes metrics like “Purchases”, “Subscribers”, or other critical actions you might care that a user did at least once.

These can also be turned into WAU/DAU/Retention metrics—a “rolling” participation.

In the previous example, we get a p-value of 0.0022, down from 0.1059.

total revenue distribution by group in adjusted data

Combining Methods

On Statsig, we typically encourage experimenters to use CUPED in conjunction with winsorization. Combining these, we get an extremely low p-value of 4.608e-09 and can clearly reject the null hypothesis that our treatment didn’t move revenue.

Where to learn more

There are surprisingly few guides on this topic, which is critical in many fields.

If you’re interested in learning more, here are some helpful resources:

Request a demo

Statsig's experts are on standby to answer any questions about experimentation at your organization.
request a demo cta image

Build fast?

Subscribe to Scaling Down: Our newsletter on building at startup-speed.

Try Statsig Today

Get started for free. Add your whole team!
We use cookies to ensure you get the best experience on our website.
Privacy Policy