Tue Oct 22 2024

Gosset wanted to estimate the quality of the company’s beer, but was concerned that existing statistical methods would be unreliable due to a small sample size.

Fortunately, Gosset was not one to give up in the face of a challenge. After some research, he devised a solution that became one of the most famous statistical tools: the t-test.

It is often stressed that for a t-test to be valid, the data should follow a normal distribution. However, this assumption can pose challenges in the context of A/B testing; you need only a quick look at the distribution of some KPIs, such as revenue, to figure out that it is far from normal.

Confronted with this clear violation of the normality assumption, many analysts are uncertain about the best course of action. This article aims to clarify the issue by explaining why deviations from normality are often not a significant concern in A/B testing, outlining the pros and cons of using the t-test in such scenarios, and exploring alternative methods to the traditional t-test.

First, let’s start by understanding the significance of the assumption of normality.

Bear in mind that statisticians aim to draw meaningful insights about an entire population, but they can only base their conclusions on data from a sample. A familiar example is election polling: statisticians want to predict the outcome for all voters but must rely on a sample of survey responses.

This gap between the population and the sample creates a situation where statisticians can never be 100% certain; instead, they can only limit the probability that they are making errors.

Errors in a t-test stem directly from the concept of statistical inference as a proof by contradiction. In A/B testing, for instance, statisticians compare the KPI of two groups. They begin with the null hypothesis that there is no difference between the groups, rejecting this hypothesis only if the data presents strong evidence against it. To minimize errors, analysts typically set a limit on the probability of rejecting the null hypothesis incorrectly (referred to as alpha or Type I error rate), usually capping it at 5% or 10%.

Here’s where the normality assumption becomes critical: when this assumption holds, statisticians can ensure that alpha remains within the desired threshold. However, when the normality assumption is violated, the alpha value may inflate, leading to a higher likelihood of incorrectly concluding that there is an effect when, in reality, no effect exists.

If you're feeling a bit unsure, let's clarify things with a simulation. We’ll simulate a two-sample t-test with 50 observations per group.

Keep in mind that this is a relatively small sample size in A/B testing settings. We’ll examine a scenario where there’s no actual difference between the two groups—meaning both groups’ KPIs follow the same distribution. We'll explore two cases, illustrated in the graph below:

Normal distribution: the KPI follows a normal distribution.

Non-normal distribution: the KPI has a non-normal distribution.

We’ll set the significance level (alpha) at 5% and perform the t-test 100 times to observe how often the null hypothesis is rejected. To get a more reliable estimate, we’ll repeat this entire process 100 times and average the rejection rates. Since there’s no real difference between the two groups, we expect the null hypothesis to be rejected about 5% of the time, in line with the chosen alpha level.

In the normal distribution case, the averaged rejection rate is 5.07%, which aligns well with the statistical expectation. However, in the non-normal distribution case, we see a slight inflation in the alpha level, with the averaged rejection rate reaching 5.75%.

For fun, let’s run the simulation again with a larger sample size of 1,000 observations (instead of 50). Interestingly, with the larger sample size, the average rejection rate is 4.94% in both the normal and non-normal cases.

This suggests that alpha inflation disappears with larger samples. Fantastic!

The quick simulation in the previous section highlights the relevance of the normality assumption for small-sized datasets.

However, data availability has changed dramatically since the good old days of Gosset at Guinness. Back then, statistical methods were tailored to work with limited data, but today, many A/B testing scenarios involve large sample sizes. This shift is a game changer when it comes to the necessity of the normality assumption.

To understand the importance of sample size in a t-test, it's helpful to dive into the underlying statistical theory.

A variable that follows a t-distribution is essentially the ratio of two random variables: the numerator is normally distributed, while the denominator is derived from a chi-squared distribution.

With this in mind, it's easier to see why the t-test statistic follows a t-distribution when the KPI has a normal distribution. Recall that the t-test statistic is:

\[ t = \frac{\bar{X}_1 - \bar{X}_2}{S_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} \]

Since the mean of a normally distributed variable itself follows a normal distribution, and the variance of a normal variable follows a chi-squared distribution, a normal KPI meets the necessary conditions for the t-test statistic to follow a t-distribution.

However, a normally distributed KPI is not the only way to ensure that the t-test statistic follows a known distribution. Key statistical principles, such as the Central Limit Theorem and Slutsky’s Theorem, show that as the sample size increases, the t statistic converges toward a normal distribution.

Since the t-distribution approaches normality with larger sample sizes, the conversion of the t-statistic to a normal distribution is essentially equivalent to a convergence into the t-distribution. Thus, the t-test remains valid for large samples, even if the KPI itself is not normally distributed.

But what qualifies as a large sample? Is it 100 observations, 1,000, or perhaps a million? According to Kochavi, a sample is considered large enough for a t-test to be valid if it reaches a sample size of:

\[ N \approx 355 \cdot \left( \frac{\left( \overline{x - \overline{x}} \right)^3}{S_x^3} \right)^2 \]

So far, we've demonstrated that the t-test remains valid for large samples, even when the KPI's distribution is not perfectly normal. But does this mean the t-test should always be the default method in such cases? Not necessarily.

While the t-test is robust and often reliable, even when the assumption of normality is violated, it may not always be the most powerful tool for detecting differences between groups. In some situations—especially when the data shows significant skewness or contains outliers—other methods may have a greater ability to detect an effect if one exists.

Two main alternatives are worth considering in such cases. One option is to use **non-parametric tests**, which do not rely on the assumption of normality. The non-parametric counterpart to the t-test is the Mann-Whitney U test, which ranks all observations and compares the average rank between groups.

While this method can indicate whether one group tends to have higher values than the other, it doesn’t directly assess central measures. Under certain conditions, such as when both groups' distributions have a similar shape, the Mann-Whitney test may allow conclusions about the median, but not about the mean as the t-test does.

Another alternative is to **use bootstrapping to simulate the distribution of the difference** between the two means.

The concept behind bootstrapping is simple: treat the sample groups as if they represent the entire population, then repeatedly resample from these groups and calculate the mean difference for each resample. This process generates a distribution of differences, which can be used to estimate the difference with a confidence interval.

The key advantage of bootstrapping is that it does not rely on any assumptions about the underlying distribution. However, it has some limitations: it can be computationally intensive and is typically more suited for constructing confidence intervals rather than for formal hypothesis testing.

These two alternatives are particularly relevant when the distribution of the KPI is non-normal and unknown. However, in fortunate cases where the distribution of your KPI is known, more specific tests can be applied to compare two means. For example, count data—such as the number of emails received in a day or customer complaints in a week—often follows a Poisson distribution.

In such cases, a test designed to compare the means of Poisson-distributed data can be used. The key advantage of this approach is that tests tailored to the specific distribution of the KPI tend to be more powerful than the t-test.

To put it simply, if your data isn’t normally distributed, you likely don’t need to worry too much about the normality assumption.

While this assumption was critical when the t-test was first developed, modern statistical theory and simulations have shown that the t-test is quite robust, especially with large samples.

However, it’s important to note that the t-test may not always be the most powerful option when normality is violated. Alternatives like non-parametric tests or bootstrapping may yield more accurate results but can come with trade-offs, such as reduced interpretability and higher computational demands.

Finally, If the KPI’s distribution is known, a tailored parametric test suited to that specific distribution may be a more powerful solution.

Questions? We've got answers. Drop us a line and we'll get you whatever information you need.

Subscribe to Scaling Down: Our newsletter on building at startup-speed.

Take an inside look at how we built Statsig, and why we handle assignment the way we do. Read More ⇾

Learn the takeaways from Ron Kohavi's presentation at Significance Summit wherein he discussed the challenges of experimentation and how to overcome them. Read More ⇾

See how we’re making support faster, smarter, and more personal for every user by automating what we can, and leveraging real, human help from our engineers. Read More ⇾

Marketing platforms offer basic A/B testing, but their analysis tools fall short. Here's how Statsig helps you bridge the gap and unlock deeper insights. Read More ⇾

When Instagram Stories rolled out, many of us were left behind, giving us a glimpse into the secrets behind Meta’s rollout strategy and tech’s feature experiments. Read More ⇾

Automation in A/B testing has freed data scientists from routine tasks—so what’s next? Ronny Kohavi shares insights on where the real value lies. Read More ⇾