Tim Chan
Lead Data Scientist, Statsig

How to calculate statistical significance

Tue Feb 04 2025

You’ve run an A/B test and the results are in, now what?

You’ve got the data and now you have to analyze the results. Your goal: Determine whether A is different from B (a classic two-sided hypothesis test).

However, real data is noisy and you’ll need to determine whether the observed differences are due to a real underlying difference or just statistical noise. Computing statistical significance is how we typically do this.

Statistical significance is a check whether the results you’re seeing are not due to randomness, and there is a real difference between A and B. This is a critical concept in hypothesis testing which applies statistical guardrails to ensure you’re not making decisions on random noise.

What is hypothesis testing?

In hypothesis testing, we have a pair of hypotheses called the null and the alternate hypothesis. The null hypothesis is simply:

  • In a two-sided test: There is no difference between A and B, or

  • In a one-sided test: B (Test) is not better than A (Control).

The alternate hypothesis is just the opposite: There is a difference between A and B, or that B is better than A. Because I don’t want to keep duplicating my words, I’m only going to refer to two-sided tests going forward.

Hypothesis testing is how we determine which hypothesis is correct.

We want to collect data and then determine whether we can reject the null hypothesis. If we can, then we accept the only hypothesis left standing, the null hypothesis.

The reason we take this convoluted approach is that scientifically and mathematically, it’s easier to model the null hypothesis and prove something is weird. Modelling the alternate hypothesis is particularly challenging, mostly because it isn’t clear how different B is from A (How much? Which direction? What does the distribution look like?).

Understanding statistical significance

The bar for having sufficient evidence to reject the null hypothesis is called statistical significance. Your data is either statistically significant or it's not. It allows us to make an equally binary decision: Do we reject the null hypothesis or not?

Key concepts: P-value and confidence interval

There are two other concepts we need to be familiar with: p-value and confidence interval.

P-value is the probability that the observed differences (between A and B) are due to chance assuming the null hypothesis (A is the same as B) is correct. A common misconception is that the p-value is the probability that the null hypothesis is correct. This is wrong, and a topic covered extensively outside of this article.

A low p-value, however, does indicate the observed difference is unlikely under the null hypothesis. And if the p-value is lower than our pre-determined threshold for statistical significance (eg, alpha = 0.05), we can reject the null hypothesis.

This lets us accept the alternate hypothesis and conclude there must actually be a difference between A and B.

Calculating statistical significance

To calculate the p-value, we need to compute the appropriate test statistic, such as a Z-score or T-statistic. This will depend on your data type and sample size. To test a null hypothesis like "there is no difference between A and B," we’ll want to compute what the observed difference between A and B, commonly called delta.

We’ll also want to know the standard error for this difference in order to get a sense of its accuracy and statistical variability. A common method is to compute the pooled standard deviation of A and B, and then derive the standard error.

With the delta and standard error, you’ll be able to compute the Z-score or T-statistic. These values will map to a corresponding p-value.

To determine whether the result is statistically significant, we’ll compare the p-value with our significance threshold (ie, alpha). If the p-value is less than alpha, we deem the results are statistically significant. Otherwise, it’s not.

Factors influencing statistical significance

Sample size directly impacts the reliability of your test results. Larger samples generally provide more reliable data, reducing the margin of error.

Standard deviation also impacts the reliability and precision of our data. It’s a measure of the variability of our data. Larger variability means it’ll be harder to accurately measure A and B. Metrics which are binomial (eg. conversion rate) tend to have a lower standard deviation and are commonly used in experimentation.

Effect size is the magnitude of the difference. A substantial effect size ensures that the findings are not only meaningful, but also easy to detect.

To succeed in hypothesis testing, you’ll generally want scenarios that have a large effect size, large sample size, and small standard deviation.

Get started now!

Get started for free. Add your whole team!
an enter key that says "free account"

Recent Posts

We use cookies to ensure you get the best experience on our website.
Privacy Policy