Products

Solutions

Resources

Docs Pricing

Products

Solutions

Resources

Products

Solutions

Resources

Pricing

Oryah Lancry-Dayan

Lead Statistician, Bell Statistics

EXPERIMENTATION

How to report test results

Thu Jan 02 2025

You’ve invested significant effort into designing and executing an A/B test: meticulously planning the sample size to achieve the desired statistical power, waiting patiently for data collection, ensuring proper allocation between control and test groups, and carefully analyzing the results. Now comes the critical moment—communicating your insights to your company’s stakeholders.

Unfortunately, this is where many analysts falter, making fundamental mistakes that can misrepresent or undermine the test's findings. To ensure your hard work translates into meaningful impact, this blog highlights common reporting errors and offers practical guidance for presenting your results effectively.

And the best part? At the end, you’ll find a suggested outline to structure your A/B test report, ensuring it’s clear, actionable, and impactful for your audience!

1. Overstating certainty

If you’ve ever struggled with a statistics course, you might have asked yourself: Why do we even need statistics? The answer lies in a fundamental challenge: people seek to draw conclusions about a phenomenon without access to all the relevant information. For example, an analyst may want to determine whether a new version of an app attracts more sign-ups. However, they can’t examine every potential user; they only have data on a subset of current new users. This sample is random and might not perfectly represent the true impact of the new version.

To bridge the gap between the available data and the phenomenon being studied, statisticians have developed a powerful framework over the last century. Importantly, this framework acknowledges a critical limitation: no single sample can provide absolute certainty. Instead, statistics focuses on the validity of the analysis process, including quantifying the likelihood of errors. Thus, statistical methods allow us to draw conclusions with a certain level of confidence, while accepting that uncertainty remains.

What can go wrong: A common mistake is forgetting that we can only be certain about our sample, while any statement about the population is inherently probabilistic. Analysts may prematurely generalize sample results to the population, leading to overly definitive claims such as, “This feature will increase revenue by 10%” or “The conversion rate in the new version improved by 5%.”

How to get it right: When communicating test results, it’s crucial to remember that your data reflects what happens in your sample and may not precisely represent the population. Use cautious language when describing results (e.g., “In our sample, the lift was 10%”), and apply statistical methods like confidence intervals to estimate the lift at the population level. This approach helps convey the inherent uncertainty while providing a more accurate and credible interpretation.

2. Confusing Test Settings with Test Results

As discussed in the previous section, statisticians acknowledge that they cannot be certain about the validity of a single dataset. However, they have developed methods to draw meaningful conclusions about the reliability of the overall analytical process. Specifically, statisticians aim to control the proportion of times the process leads to incorrect conclusions.

For instance, in hypothesis testing, the alpha () parameter represents the probability of incorrectly concluding that an effect exists when, in reality, it does not (a Type I error). The principle is that if the testing process were repeated many times under conditions where no true effect exists, the proportion of tests that falsely indicate an effect would not exceed the alpha level. For example, if alpha is set to 5% and the null hypothesis is rejected, we cannot be certain that our conclusion is correct. However, we do know that the probability of making a false conclusion is controlled and limited to 5%. Correspondingly, statisticians define a confidence level, equal to 1-, which reflects the degree of confidence that the conclusion is correct.

Unlike alpha and confidence level, which are parameters of the testing process, hypothesis testing also involves values computed from the observed data, such as the lift or the p-value. The p-value represents the probability of obtaining results as extreme as the observed lift, assuming the null hypothesis is true (i.e., there is no effect). In this way, the p-value quantifies how surprising or extreme the observed results are under the null hypothesis.

What can go wrong: Errors often arise when analysts conflate alpha with the p-value. A subtler form of this error occurs when the p-value is incorrectly used to refer to the validity of the results. For instance, if the p-value is 2%, an analyst might incorrectly state, “We won with 98% confidence.” A more serious mistake involves using 1-pvalue to redefine the test's settings. In this scenario, the analyst might claim that the test's significance level is 2% or that the confidence level is 98%, based solely on the observed p-value and ignoring the predefined setting of the test.

How to get it right: The key to avoiding these errors is understanding the fundamental differences between alpha and the p-value. When discussing the validity of a test—such as its significance level or confidence level—these values should be based on the predefined test settings, which involve alpha, and remain independent of the observed results. The p-value cannot provide any insight into the reliability of the testing process itself; it reflects only the properties of the current sample. Therefore, using 1-pvalue to infer the validity of the results is incorrect and misleading.

3. Misinterpreting p-values

Now that we understand the concept of the p-value, we can address another common error in reporting test results: the misunderstanding of what the p-value actually represents. To clarify, the p-value is calculated based on the data and represents the probability of obtaining the observed results (or more extreme ones) under the assumption that there is no effect in the data.

What can go wrong: A common error is to refer to the p-value as a measure of error, rather than a measure of the results. This is often reflected in statements such as, "With a p-value of 2%, there is a 2% chance that the result is wrong." However, since the p-value is computed based on the sample, it cannot provide any insights into the validity of the testing process itself. In other words, you cannot use the sample to validate its own correctness. An analogy would be someone claiming they are right simply because they say so!

How to get it right: Report the p-value for what it is: a measure of how unlikely the observed results are, assuming there is no effect. You can use the p-value to gauge the strength of the evidence against the null hypothesis. A very low p-value provides strong evidence against the null hypothesis, suggesting that the observed effect is unlikely to have occurred by chance.

4. Misinterpreting confidence interval

So far, our primary focus has been on hypothesis testing. However, result reports often include a confidence interval as well. While hypothesis testing determines whether the difference between two versions is statistically significant, a confidence interval provides an estimate of this difference value.

The concept of a confidence interval recognizes that the value obtained from a single sample is unlikely to perfectly match the true difference between the two versions. For instance, if the observed lift in the sample is 4%, it is unlikely that the entire population of users would show exactly the same 4% lift. Instead, the true value is likely to lie close to the sample's lift, within a certain range.

Thus, to estimate the difference between the two versions, we use an interval rather than relying solely on a point estimate. Since the true difference is unknown, it is impossible to guarantee that every interval will capture it. What can be controlled, though, is the error rate—the proportion of intervals that fail to include the true difference. This error rate is commonly denoted as alpha (), while 1- represents the confidence level, which indicates the proportion of intervals expected to correctly capture the true value.

Importantly, the confidence level reflects the reliability of the interval construction process rather than the accuracy of any single interval. For example, a 95% confidence level means that if the process of constructing confidence intervals were repeated many times, 95% of those intervals would contain the true difference.

What can go wrong: Analysts often mistakenly attribute the confidence level to the value itself, making statements like, “There is a 95% probability that the true lift is between 2% and 4%”. This interpretation is incorrect. Once the confidence interval has been calculated, the true lift either falls within the interval (100% certainty) or it does not (0% certainty). For instance, if the true lift is 3.2%, it is definitively within the observed interval, and if it is 1.7%, it is definitively outside it. The confidence level (e.g., 95%) reflects the probability that an interval would contain the true value, not the probability that the true value is within the current interval.

How to get it right: When describing a confidence interval, attribute the confidence to the interval construction process, not to the true value being estimated. For example, you could describe the results as, "The lift is estimated to be between 2% and 4%, and we are 95% confident that this interval is valid." Always ensure you understand and can explain the correct interpretation of a confidence interval.

5. Ignoring external validity

Congratulations! You are now aware of the common misconceptions when reporting test results. The final pitfall to avoid is more about understanding the context of your test than the statistical theory behind it. Specifically, when generalizing your results from the sample to the population (a concept known as external validity), it is crucial to consider the timing of the test and the profile of users.

What can go wrong: When reporting test results, analysts sometimes use deterministic language, making broad generalizations about the entire population without caveats. While this can occasionally be justified, the description of the results should usually be more cautious and include appropriate qualifiers to reflect the limitations of the sample and the context of the test.

How to get it right: When describing your results, be sure to mention any limitations and contextual factors. For example, consider the timing of the test—was it conducted during a representative period, or could it have been influenced by a special time of year? Additionally, address the characteristics of the users—did the observed effect apply to all types of users, or was it more pronounced for specific groups? Is the sample representative of the broader user population? Acknowledging these factors helps provide a more accurate interpretation of the results.

An example: Report of test’s results

Understanding common errors in test results is the first step toward creating an impactful test report. The next step is to craft a clear and precise report that effectively conveys the key findings. Here are the important building blocks for your report:

The bottom line: Your report should begin with the key findings of the test, including the impact on the main KPI (such as the uplift, confidence interval, and p-value). You can also highlight one or two other noteworthy results, such as significant changes in other KPIs, important segments, or any other relevant observations. Conclude this section with the decision based on the test’s results. The goal of this section is to provide a concise summary for readers who want the essential information without delving into the details of the test.
Primary KPI: Present the main findings for the primary KPI, ideally with a visualization of the confidence interval for the difference between the two versions. Clearly indicate the boundaries of the confidence interval and highlight whether 0 falls within or outside the interval. If you have more than two test versions, include a confidence interval for each version compared to the control. Use color to emphasize your key points, such as green for significant results. Additionally, provide supporting details, such as the p-value, in a sidebar for further context. If your test involves sequential testing, consider including a graph that illustrates the progression of the test over time
Secondary KPIs: For secondary KPIs, summarize the results visually or in a table that includes the uplift, the boundaries of the confidence interval, and the p-value.
Information about the current test: It can be helpful to include a summary of the main characteristics of the test at the beginning of the report. Beyond the test’s name, this should cover the primary hypothesis or motivation for the test, the versions included in the test, and the number of users in each group. It can be also helpful to include a screenshot or a GIF of the changes made in the current test.
Insights and future directions: You can conclude your report with insights gained during the test, which may serve as valuable leads for future tests. Highlight any patterns, unexpected findings, or potential areas for further exploration that could inform the design of upcoming experiments.

Curious to see how it works in practice? Take a look at this example: Reading Experiment Results | Statsig Docs

Permalink: https://www.statsig.com/blog/how-to-report-test-results

Products

Solutions

Resources

Products

Solutions

Resources

Docs

Pricing

Back to Blog home

Allon Korem

Oryah Lancry-Dayan

How to report test results

1. Overstating certainty

2. Confusing Test Settings with Test Results

3. Misinterpreting p-values

4. Misinterpreting confidence interval

5. Ignoring external validity

An example: Report of test’s results

Recent Posts

Optimizing cloud compute costs with GKE and compute classes

Pablo Beltran

How Statsig lets you ship, measure, and optimize AI-generated code

Sid Kumar, Brock Lumbard

Your users are your best benchmark: a guide to testing and optimizing AI products

Skye Scofield

The more the merrier? The problem of multiple comparisons in A/B Testing

Allon Korem, Oryah Lancry-Dayan

Randomization: The ABC’s of A/B Testing

Allon Korem, Oryah Lancry-Dayan

Speeding up A/B tests with discipline

Yuzheng Sun, PhD