Products

Solutions

Resources

Docs Pricing

Products

Solutions

Resources

Products

Solutions

Resources

Understanding (and reducing) variance and standard deviation

Fri Jan 17 2025

If we knew everything, we wouldn't need statistics. But uncertainty is an inevitability, and it's useful to be able to quantify it.

Standard deviation (and its square, variance) are extremely common measurements used to understand how much variation of “spread” an observed variable has.

In short, a low standard deviation means that observations cluster around a dataset’s mean, and a high standard deviation means that observations are spread out in a wide range around a dataset’s mean.

The reason this is such a common and powerful measurement is that, in conjunction with the central limit theorem and a few assumptions, we can use standard deviation as a way to—at a high level—quantify the probability of a given set of observations happening.

This lets us establish confidence intervals in polling and quality measurements, run A/B tests for hypothesis testing with quantifiable probabilities, assess risk in outcomes and markets, and better understand the world around us.

Applications

In the United States, the average height for adult men is 69 inches, with a standard deviation around 3 inches. Knowing these two numbers, we can estimate that only around 2.5% of men are taller than 75 inches (two standard deviations) in the United States.
Polling. If we ask random people if they support a policy or not, we can calculate the variance of their 1/0 responses and construct an interval that we’re 95% confident the true population “support rate” lies within.
If we are producing a beverage and want to make sure the cans have 12 ounces of fluid, we can randomly sample 100 bottles of every 100,000 and measure the volume in the can. If the standard deviation is 0.05 oz, and the mean is 11.92oz, we expect only 1% of cans to have 11.8oz or less.
If we take a population of 10000 users, split them into even groups, and give half a treatment and half a placebo, we can use standard deviation to evaluate if the treatment did anything. The way this works is we:
- Assume there is no treatment effect
- Measure our outcome metric
- Calculate the standard deviations of the populations’ metric, and the difference in mean metric values between the two groups
- Calculate the probability of observing that difference in means, given the standard deviation/spread of the population metric
- If the probability is low (usually 5%, but in medicine this can be as low as 0.1%), we conclude that it’s unlikely to observe this effect due to chance and there was likely a treatment effect
This practice is commonly called a/b testing or randomized controlled trials, and usually evaluated using a t-test.

Calculations

Variance is the average of squared differences from the mean. For each observation, we subtract the mean, multiply the result by itself, and then add all of those values up
Standard deviation is the square root of the variance in the population
Standard error is the standard deviation divided by the square root of the number of observations

Let's say this is our dataset:

Simple Table

Detailed Table

Measure	Calculation	Symbol	Value
iᵗʰ Observation		xᵢ	4, 7, 3, or 10
Population Size		n	4
Mean	Σxᵢ / n	μ	24/4 = 6
Variance	Σ(xᵢ-μ)² / n	σ²	((-2)² + 1² + (-3)² + (4)²)/4 = 7.5
Standard Deviation	√σ²	σ	√7.5 = 2.74
Standard Error	σ/√n	sₓ or σₓ	2.74/√4 = 1.37

Generally the true population standard deviation/variance is unknown, so we infer it from our sampled data, and call it the Sample Standard Deviation

Reducing standard deviation

In all of the use cases above (and more), we are trying to use statistics to inform a decision, and standard deviation represents uncertainty.

When standard deviation is high, we need to see an extreme result in order to form a conclusion. For example, if the standard deviation in our a/b test was close to infinity, any observation would be reasonably likely to occur by chance . Conversely, if the standard deviation was near 0, any change in mean could reasonably be attributed to our experimental treatment.

Because of this, modern measurement tools seek to reduce standard deviation in order to allow people to make decisions with more certainty.

For the examples below, we’ll apply techniques to this dataset, representing user revenue data from a 50/50 a/b test where the test did lead to an increase in revenue.

Without any adjustments, the p-value is 0.106, so there’s a 10.6% chance of observing this effect or greater. Typically this isn’t strong enough evidence to conclude a real effect.

total revenue distribution by group in raw data

Outlier management

It’s well known that outliers can heavily skew mean values.

For example, if you have a population with 100 users, and 50 have a value of 4.5, 49 have a value of 5.5, and 1 has a value of 10,005, the average value is 105. The outlier has dominated the mean, since without them the population average is 5.

Similarly, the standard deviation of that dataset is 999.5; without the outlier user, it is 0.5. Since there is a squared term in variance, extreme values are even more influential in its calculation.

Because of this, it’s very common to deal with outliers in a number of ways:

Filtering

The easiest way to deal with outliers is to remove them from the dataset. However, this is usually not a good solution, since they do provide meaningful information. This is only recommended if you can confirm that the outliers are “bad” or “buggy” data.

Winsorization

The most common way to deal with outliers is winsorization. This practice involves identifying the metric value at the Nth percentile of your population, and setting any values over that to that value. In the example above, the 98th percentile would be 5.5, so we’d just turn the 10,005 user into a 5.5 if winsorizing at 98%.

Similarly to filtering, this does remove some information from the dataset, but in a less extreme way. Generally, for experimentation, we recommend winsorizing at 99.9%, which keeps most users at their original metric value, but prevents extreme tails.

total revenue distribution by group in adjusted data

In the case above, we only reduced standard deviation a small amount with 98% winsorization, that’s because this dataset is well-behaved and doesn’t have really extreme outliers. Our p-value went from 0.1059 to 0.0519 (which is still a notable improvement!)

Let’s try this on a dataset with extreme outliers - in this case almost certainly a logging bug.

We reduced our p-value from 0.62 to 0.51, with standard deviation going down by 91%.

Capping

Capping is another way to deal with outliers. This is similar to winsorization, but more manual; generally, you set a user-day value that you think is a reasonable cap, and set any values over the cap to the cap.

For example, an online book-seller might choose to set a cap of 10 books/day in their measurement to avoid their experiments being overly influenced by a collector or library purchasing 10,000 books in a day.

Let’s cap the outlier dataset at 100. This reduces standard deviation by 99%—which makes sense, since it’s similar to winsorizing and we used a more aggressive threshold.

CUPED (Experimentation)

CUPED—or regression adjustment—is a tool by which we explain away some of the variance in a dataset by using other explanatory variables besides “random chance”.

Without regression adjustment, we attribute what we see in our population purely to random noise. This isn’t true, though. For example, height is not totally random; your parents’ height is a very strong predictor of your adult height.

In that example, if we control for parent height, we’re taking out a “random variable” from our variance and will end up with less variance in our dataset:

Table with Empty Cells

Father’s height (in)	Adult Male Height (in)	Adjusted (naive subtraction)	Naive Variance	Adjusted Variance
63	65	2	7.84	2.64
75	73	-2
70	72	2
70	69	-1
68	69	1

In practice, this is a bit more complicated. Most notably, the explanatory variable here isn’t a perfect predictor, so we can run a regression on the data points to build a better adjustment. In this case a basic OLS regression would be h = 22.973 + 0.674(ph)

Final Table

Father’s height (in)	Adult Male Height (in)	Adjusted (use residual, or value - prediction)	Naive Variance	Adjusted Variance
63	65	-0.42	7.84	1.05
75	73	-0.51
70	72	1.86
70	69	-1.14
68	69	0.21

You can also use other variables. For example, parent income or country may provide additional information. If you’re curious, we recommend reading our article on CUPED or checking the resources at the end of this article.

In our example dataset, we see the following transformation, which reduces the standard deviation by 80%; the p-value goes from 0.1059 to <0.0001.

Transformations

Thresholds

Thresholding is the practice of creating 1/0 user-level flags if their metric value passes a certain threshold. This trades information detail for power.

If we convert revenue to a 1/0 flag for if a user spent >$10 during our experiment, this reduces our p-value to 0.0026; we have very strong evidence for the hypothesis that more users spent at least $10!

Note that comparing standard deviation makes less sense in this case, since we’ve also shifted the mean.

Participation

Participation is simply a threshold metric with the threshold at 0. This is common since, intuitively, it just means that the user did an action. This includes metrics like “Purchases”, “Subscribers”, or other critical actions you might care that a user did at least once.

These can also be turned into WAU/DAU/Retention metrics—a “rolling” participation.

In the previous example, we get a p-value of 0.0022, down from 0.1059.

Combining methods

On Statsig, we typically encourage experimenters to use CUPED in conjunction with winsorization. Combining these, we get an extremely low p-value of 4.608e-09 and can clearly reject the null hypothesis that our treatment didn’t move revenue.

Where to learn more

There are surprisingly few guides on this topic, which is critical in many fields.

If you’re interested in learning more, here are some helpful resources:

Request a demo

Statsig's experts are on standby to answer any questions about experimentation at your organization.

Grab a Demo

Permalink: https://www.statsig.com/blog/understanding-and-reducing-variance-and-standard-deviation

Products

Solutions

Resources

Products

Solutions

Resources

Docs

Pricing

Back to Blog home

Craig Sexauer

Understanding (and reducing) variance and standard deviation

If we knew everything, we wouldn't need statistics. But uncertainty is an inevitability, and it's useful to be able to quantify it.

Applications

Calculations

Reducing standard deviation

Outlier management

Filtering

Winsorization

Capping

CUPED (Experimentation)

Transformations

Thresholds

Participation

Combining methods

Where to learn more

Request a demo

Recent Posts

Profiling Server Core: How we cut memory usage by 85%

Daniel Loomb

Correct me if I'm wrong: Navigating multiple comparison corrections in A/B Testing

Allon Korem

2 Events, 2 Audiences, 2 Tones. 1 Statsig.

Jessie Ong

Experiments with AI in the Creative Process

Cat Lee

Helping customers move faster: the story behind Statsig University

Julie Leary

Full support for Statsig Experimentation & Analytics in Microsoft Fabric

Sid Kumar, Xin Huang