This misconception has caused confusion among many, even seasoned statisticians, about the nature of hypothesis testing.

In this article, let's clear up the confusion by exploring:

1) Where does this misunderstanding come from,

2) Why people believe you shouldn't accept the null hypothesis, and

3) Why and when you actually should accept it.

The root cause is that modern textbooks often mix Fisher’s significance testing with Neyman-Pearson’s hypothesis testing, without explaining the key differences (ref).

It’s like blending the rules of two different sports. Can you touch the ball with your hands? It depends—are you playing basketball or soccer? Saying it’s illegal to touch the ball with your hands in basketball makes no sense, but that’s essentially what happens when people declare “accepting the null hypothesis” as wrong under the hypothesis testing framework.

So, can you "accept" the null hypothesis? In Fisher’s framework, no. But in Neyman-Pearson’s framework, yes—you can, and you must.

First, we can never definitively "prove" a hypothesis based on observations alone, because for any given observation, there are infinitely many possible hypotheses that could be true, each with different probabilities.

Second, in the strict Fisher p-value framework, there is no alternative hypothesis. While we may set a threshold for rejecting the null hypothesis (e.g., p < 0.05), there isn't a similarly clear rule for which alternative hypothesis we should “accept” if we fail to reject the null. This contrasts with the Neyman-Pearson framework, where there is a specific alternative hypothesis and its specific beta.

Third, Fisher’s original stance was that we shouldn't assume the null hypothesis is true with complete certainty. In fact, he didn’t support the idea of a fixed significance level (like 0.05) but instead suggested that the p-value should be seen as a continuous measure of evidence against the null hypothesis.

In sum, the dangers of using the term "accepting a hypothesis" in the p-value framework are:

Many people mistakenly interpret "accepting" the null as "proving" it, which is incorrect.

"Accepting the null hypothesis" isn't a rigorously defined concept and overlooks the core purpose of the test, which is to decide whether to reject the null hypothesis given our observations.

Therefore, within Fisher’s p-value framework, calling something "accepting the null hypothesis" is essentially invalid. But note that **terms like alternative hypothesis, alpha, beta, power, and minimum detectable effect (MDE) are also out of place in this context. **If someone uses these terms while calling accepting a hypothesis illegal, he is inconsistent.

In the Neyman-Pearson framework, "accepting" both the null and alternative hypotheses is not only allowed but necessary. However, **"accepting" a hypothesis doesn’t mean you believe it; it simply means you act as though it’s true.**

In hypothesis testing, remember how we can outline the null and alternative hypotheses and precisely calculate alpha and beta? This process isn't possible unless we temporarily assume one of the hypotheses is true.

The Neyman-Pearson framework, often referred to as the hypothesis testing framework, is a mathematically consistent approach for linking observations, hypotheses, and decision rules, all while calculating the relevant probabilities. I have a 15-minute video that walks you through this framework with helpful visualizations. I highly recommend you to watch this video if you have been confused by textbooks.

If this is your first time learning that these two are different frameworks, please forget what you’ve learned for a minute and treat these two frameworks as basketball and soccer. Below is a comparison between these two frameworks.

**Focus on significance testing:**Fisher introduced the concept of significance testing, where the primary goal is to assess whether the observed data provide strong enough evidence to reject the null hypothesis.**Null hypothesis as a default assumption:**In Fisher's approach, the null hypothesis ((H_0)) represents a default position that there is no effect or no difference. It is a specific hypothesis that is tested without necessarily specifying an alternative hypothesis.**Use of p-values:**The p-value is central in Fisher's framework. It measures the probability of observing data as extreme as (or more extreme than) the observed data, assuming the null hypothesis is true. A smaller p-value indicates stronger evidence against (H_0).**No fixed significance level:**Fisher did not advocate for a fixed significance level (like 0.05) but suggested that the p-value should be interpreted as a continuous measure of evidence against (H_0).

**Focus on hypothesis testing as decision-making:**Neyman and Pearson formalized hypothesis testing as a decision-making process between two competing hypotheses: the null hypothesis ((H_0)) and the alternative hypothesis ((H_1)).**Null and alternative hypotheses treated symmetrically:**Both (H_0) and (H_1) are explicitly defined, and tests are designed to decide between them based on the data.**Control of error rates:**The framework introduces Type I error (rejecting (H_0) when it is true) and Type II error (failing to reject (H_0) when (H_1) is true). Significance levels (alpha) and power (1 - beta) are predetermined to control these error rates.**Use of critical regions:**Instead of p-values, the Neyman-Pearson approach uses critical values and regions to decide whether to reject (H_0), based on the likelihood ratio or test statistic.

**Purpose and interpretation:***Fisher:*The null hypothesis is a provisional assumption to be tested. It is not necessarily meant to be accepted or rejected definitively but used to measure the strength of evidence against it.*Neyman-Pearson:*The null hypothesis is one of two competing hypotheses, and the testing procedure is designed to make a clear decision to accept or reject (H_0) based on controlled error probabilities.

**Role of the alternative hypothesis:***Fisher:*The alternative hypothesis is often implicit or not formally specified. The focus is on assessing evidence against (H_0).*Neyman-Pearson:*The alternative hypothesis ((H_1)) is explicitly defined, and tests are constructed to distinguish between (H_0) and (H_1).

**Decision-making vs. evidence assessment:***Fisher:*Emphasizes measuring evidence against (H_0) without necessarily making a final decision.*Neyman-Pearson:*Emphasizes making a decision between (H_0) and (H_1), incorporating the long-run frequencies of errors.

**Fisher's Null Hypothesis:**A unique, specific hypothesis tested to see if there is significant evidence against it, using p-values as a measure of evidence.**Neyman-Pearson's Null Hypothesis:**One of two explicitly defined hypotheses in a decision-making framework, where tests are designed to control error rates and decide between (H_0) and (H_1).

**References:**

Fisher, R.A. (1925).

*Statistical Methods for Research Workers.*Neyman, J., & Pearson, E.S. (1933). "On the Problem of the Most Efficient Tests of Statistical Hypotheses."

*Philosophical Transactions of the Royal Society A*, 231(694-706), 289-337.

Statsig's experts are on standby to answer any questions about experimentation at your organization.

Subscribe to Scaling Down: Our newsletter on building at startup-speed.

A short list of reasons why a great experimentation tool is a horrible idea.

How we optimized Pod Disruption Budgets in Kubernetes to reduce resource waste and improve rolling updates for service deployments handling live traffic.

Statsig's AI Prompt Experiments allow you to run experiments for AI-powered products and gain real-time insights into what's working and what's not.

Master data-driven product development with Statsig. Simplify experimentation, make informed decisions, and accelerate your product's growth—all without complex coding.

Use our customizable, detailed cost comparison tool and flexible pricing assumptions to find out which platform reigns supreme.

Our recent optimizations to target apps significantly reduce config propagation latency, ensuring performance and stability for large-scale environments using Statsig.