Yet itâ€™s validity and usefulness is often questioned. Itâ€™s called too conservative by some[1], and too permissive by others. Itâ€™s deemed arbitrary (absolutely true), but thatâ€™s a good thing! Iâ€™m proponent of 95% confidence intervals and recommend them as a solid default.

Thereâ€™s a reason itâ€™s been the standard from the very start of modern statistics, almost 100 years ago. And itâ€™s even more important now in the era of online experimentation. Iâ€™ll share why you should make 95% your default, and when and how to adjustÂ it.

This is a common term in experimentation, but like p-values, itâ€™s not intuitive. Even Ivy League stats professors can get it wrong [2]. By the book, a 95% confidence interval is a numerical range which upon repeated sampling, will contain the true value 95% of the time. In practice it servesÂ as:

A range of plausible values

A measure of precision

An indicator of how repeatable/stable our experimental methodÂ is

(These are technically incorrect interpretations, but Iâ€™ll defer to those more educated on this topicÂ [3]).

Perhaps the most valuable and correct use of a 95% confidence interval is as a cutoff for rejecting the null hypothesis. This is also known as a 5% significance level (100% - 95% = 5%). Your hard-fought experiments, and oftentimes hopes and dreams, instantly become successes or failures. There is no middle ground.

Confidence intervals donâ€™t distinguish between absolutely zero effect (p=1.0) or close calls (p=0.051). Both scenarios reach the same conclusion; there is no true experimental effect. The plushness of random error is rudely sliced into a yes/no evaluation.

Create a free Statsig account and get deeper insight into your data within minutes.

Thus if there were no 5% level firmly established, then some persons would stretch the level to 6% or 7% to prove their point. Soon others would be stretching to 10% and 15% and the jargon would become meaningless.â€Š

Irwin D. J.Â Bross

Itâ€™s this callous nature that makes 95% confidence intervals so useful. Itâ€™s a strict gatekeeper that passes statistical signal while filtering a lot of noise out. It dampens false positives in a very measured and unbiased manner. It protects us against experiment owners who are biased judges of their own work. Even with a hard cutoff, scientific authors comically resort to creative language to color borderline results and make them something more.

But why 95%? It was set by the father of modern statistics himself, Sir Ronald Fisher [5]. In 1925, Fisher picked 95% because the two-sided z-score of 1.96 is almost exactly 2 standard deviation [6]. This threshold has since persisted for almost aÂ century.

But even though this is an arbitrary number, there are many reasons to useÂ it:

**Itâ€™s unbiased.**Using what others use is defensible. Youâ€™ve decided to play by the same rules that others play by. Attempts to change this number (eg. 90% or 99%) can be viewed as subjective manipulations of the experiment rules. Itâ€™s like a trial lawyer defining what â€śbeyond a reasonable doubtâ€ť actuallyÂ means.**Itâ€™s a reasonably high bar.**It represents a 1 in 20 chance of finding a significant result by pure luck (with no experimental effect). This removes 95% of potential false positives and serves as a reasonable filter of statistical noise.**Itâ€™s a reasonable low bar.**In practice, itâ€™s an achievable benchmark for most fields of research to remain productive.**Itâ€™s ubiquitous.**It ensures weâ€™re all speaking the same language. What one team within your company is calling significant is the same as anotherÂ team.**Itâ€™s practical.**Itâ€™s been argued that since p=0.05 remains the convention, it must be practically useful [7]. If it was too low, researchers would be frustrated. If it was too high, we would have a lot of junk polluting our research. Fisher himself use the same bar throughout his career without adjusting thisÂ bar.**Itâ€™s an easy choice.**Fine-tuning your confidence interval in a defensible and unbiased manner requires some work. In most cases, itâ€™s a better use of your time to formulate ideas and focus on running experiments.

Statsig's experts are on standby to answer any questions about experimentation at your organization.

For all the reasons above, I recommend most experimentalists default to use 95%. But there are a few good reasons why you should adjustÂ it:

**Your risk-benefit profile is unique.**You may either have a low tolerance for false positives or false negatives. For example, startup companies that have a high risk tolerance will want to minimize false negatives by selecting lower confidence intervals (eg. 80% or 90%). People working on critical systems like platform integrity, or life-saving drugs may want to minimize false positives and select higher confidence intervals (eg.Â 99%).**You have the wrong amount of statistical power.**Youâ€™ve run power calculations that fail to produce a reasonable sample size estimate. In some cases, you have too few samples and can reverse-engineer your confidence interval. In other cases, you may be blessed with too many samples and can afford to cut down your false positive rate (This is a big data problem!).

Selecting a custom confidence interval trades off between false positive and false negative rates. Lowering the bar by shrinking your confidence interval (to say 90%) will increase your false positive rate, but decrease your false negative rate. This will pick up more real effects but also more statistical noise. Properly tuning this number means matching your risk profile. Properly doing this requires weighing the costs of a false positive against a false negative.

If you choose to venture down this path, I have some guidelines:

Set your confidence threshold BEFORE any data is collected. Cheaters change the confidence interval after thereâ€™s an opportunity toÂ peek.

Try to reuse your custom confidence interval. Itâ€™s tedious and potentially biased to do this on an experiment-by-experiment basis. Itâ€™s much more useful to identify a broad set of situations and experiments where the new confidence interval should be broadlyÂ applied.

Most people, especially experimentation beginners, should stick with 95% confidence intervals. Itâ€™s a really good default that applies to a lot of situations that doesnâ€™t invite extra questioning. But if you insist on changing it, make sure it matches your situation and risk profile, and do this before you start the experiment.

You're invited to create a free Statsig account! Get started today with 2M free events. No credit card required, of course.

Gelman, Andrew (Nov. 5, 2016). â€śWhy I prefer 50% rather than 95% intervalsâ€ť.

Gelman, Andrew (Dec 28, 2017). â€śStupid-ass statisticians donâ€™t know what a goddam confidence interval isâ€ť.

Morey, R.D., Hoekstra, R., Rouder, J.N.

*et al.*The fallacy of placing confidence in confidence intervals.*Psychon Bull Rev***23,**103â€“123 (2016).Otte, W.M., et al. Analysis of 567,758 randomized controlled trials published over 30 years reveals trends in phrases used to discuss results that do not reach statistical significance. PLOS Biology 20(2) (2022).

Fisher, Ronald (1925).

Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd. p. 46. ISBN 978â€“0â€“05â€“002170â€“5.

Bross, Irwin D.J. (1971). â€śCritical Levels, Statistical Language and Scientific Inference,â€ť in Godambe VP and Sprott (eds) Foundations of Statistical Inference. Toronto: Holt, Rinehart & Winston of Canada,Â Ltd.

Cowles, M., & Davis, C. (1982). On the origins of theÂ .05 level of statistical significance.

*American Psychologist, 37*(5), 553â€“558.Simon, Steve (May 6, 2002). â€śWhy 95% confidence limits?â€ť.

Subscribe to Scaling Down: Our newsletter on building at startup-speed.

A deep dive into CUPED: Why it was invented, how it works, and how to use CUPED to run experiments faster and with less bias.

Learn key insights from Ronny Kohavi and Allon Korem on building a strong experimentation culture, infrastructure, and learning from failures in A/B testing.

Optimizely was the first web experience platform to gain considerable market share, but a lot has changed since then.

From Marketplace failures to a game-changing A/B test, Deltoid and causal evidence reshaped Facebook's product strategies as well as my own beliefs.

My first few months at Statsig were full of hackathons, team building, and some seriously cool projects. Find out what makes Statsig's culture special.

A/B testing is the most reliable way to get evidence. Whether you're an advanced experimenter, or delving into testing for the first time, here's what you should know: