Consequently, some users may never actually be exposed to the new version of the page. I then asked the interviewees, “Could this raise any issues in the A/B testing process?”
This question highlights a common issue in A/B testing where the allocation point does not align with the exposure point. Simply put, the allocation point (often referred to as the “feature gate”) is when a user is assigned to either the control group or a test group. In contrast, the exposure point is when the user interacts with the version of the product they were allocated. Ideally, these two points should align—once a user is assigned to a group, they should immediately experience the corresponding version.
However, as demonstrated in the interview scenario, this is not always the case. Discrepancies between allocation and exposure points can arise in other contexts as well. For example, in email campaigns where we A/B test content, we often lack visibility on who opened the email and who did not. Consequently, not everyone assigned to the test group may actually be exposed to the test version.
The goal of this blog is to explore situations where the allocation point differs from the exposure point. We will examine when these cases occur, why they matter, and how we can address them. In other words, if you're curious about how you should have responded to my interview question, this blog is for you!
There are two primary reasons why the allocation point may differ from the exposure point. The first reason is technical. In some situations, as previously described, capturing the exact moment of exposure can be challenging. For instance, while tagging a page load event is relatively straightforward, tagging an event like scrolling down a page requires additional resources and effort. As a result, during test design, only a page load event might be tagged. Consequently, analysts will have data on allocation but will lack information on whether users actually experienced the intended version.
The second reason is more fundamental. Discrepancies between allocation and exposure points can arise when there is a rationale for using offline allocation. This occurs when group assignments are made prior to the experiment.
One advantage of offline allocation is that it allows analysts to control for user characteristics in the control and test groups, such as ensuring equal proportions of females or highly active users in each. However, a key drawback is that some users may be assigned to a group but never engage with the product during the test phase. This gap between assigned groups and actual user interaction can result in a misalignment between group allocation and exposure to the product.
Now that we understand why the allocation point doesn’t always align with the exposure point, you might wonder: why does this matter? If you’re aiming to identify the true impact of your new version—especially when a real effect exists—this discrepancy is crucial: misalignment between allocation and exposure points can reduce your ability to detect the effect of the new version, potentially obscuring valuable improvements to your product.
Let’s delve a bit deeper into the mechanics of this effect. To start, here’s a brief refresher on some foundational statistics for A/B testing. In hypothesis testing, the core method behind A/B testing, we consider two competing hypotheses: first, that there is no difference between the two versions (the "null hypothesis") and second, that there is a difference (the "alternative hypothesis").
This statistical approach is conservative: we begin by assuming the null hypothesis is true and only reject it if our test provides strong evidence to the contrary, thereby concluding that a difference exists between the groups.
A key element of an A/B test is its statistical power—the probability of correctly rejecting the null hypothesis when a real effect exists. High power ensures that the test is sensitive enough to detect meaningful differences. When there's a discrepancy between the allocation point and exposure point, however, the power of the test can be weakened.
Why is this? Intuitively, if some users are allocated to the test group but aren’t actually exposed to the test version, they effectively behave like control group users. This introduces "control-like" users into the test group, which dilutes any potential impact of the test version, making it harder to detect a true effect if it exists.
If intuition alone isn’t enough, let’s clarify this with a simulation. Remember, users who are allocated to the test group but are not exposed to the test are effectively acting as control users. In this simulation, we’ll vary the proportion of these "control-like" users in the test group from 5% to 95%. We’ll simulate a scenario where a true difference exists between the groups, with an initial power of 0.9 to detect this difference (assuming all users in the test group are exposed to the effect).
For each proportion of "control-like" users, we’ll repeat the test 1,000 times and calculate the proportion of times the test correctly rejects the null hypothesis. This gives us an estimation of the test’s power as function of the proportion of control-like users.The graph below summarizes the simulation results:
As shown in the graph, when there are no "control-like" users in the test group, we achieve the desired power level of 0.9. However, as we increase the proportion of control-like users—that is, users who are allocated to the test group but not actually exposed to the test—the power decreases sharply. This demonstrates the significant impact that unexposed users can have on the test’s ability to detect a true effect.
Let’s start with some good news: as long as the non-relevant population is randomly distributed between the control and test groups, the results of the t-test remain valid. This means that if you’ve already run the test, differences in allocation or exposure points might affect your ability to detect a true effect, but they won’t increase the risk of a Type I error (i.e., incorrectly rejecting the null hypothesis, or alpha). Therefore, if you already have results, you can proceed with your analysis.
As shown in the graph, if the proportion of non-relevant users in the test group is small, you’ll still maintain sufficient statistical power to detect true effects. Another post-hoc option, particularly useful when your results are close to significance, is to filter out non-relevant users. This approach is only feasible if you can identify some indication of the exposure point and retrospectively filter users accordingly.
However, the best solution to discrepancies between allocation and exposure points is proactive preparation. If the discrepancy arises from a technical limitation (e.g., difficulty in properly tagging the exposure event), it’s important to invest time and effort in ensuring accurate tagging.
If this is not possible, or if the discrepancy is due to offline allocation, the analyst should attempt to estimate the proportion of non-relevant data. For instance, in the case of offline allocation, historical data can provide insights into how many participants were allocated to the test group but did not ultimately participate.
This proportion should be factored into the test planning process, either by increasing the statistical power or by reducing the hypothesized effect size. While this may lead to an increase in sample size (which has its own drawbacks), it ensures that the power remains sufficiently high for detecting true effects.
Discrepancies between the allocation point and exposure point often go unnoticed. The goal of this article is to highlight this potential issue and examine its impact on the ability to detect effects in A/B testing. While there are some post-hoc solutions, the best approach is to prevent the issue from arising in the first place—primarily by ensuring that the exposure point is accurately tagged during the experiment. If this is not feasible, compensating for the resulting loss of power during the planning phase can also be an effective strategy.
Understand the difference between one-tailed and two-tailed tests. This guide will help you choose between using a one-tailed or two-tailed hypothesis! Read More ⇾
From continuous integration and deployment to a scrappy, results-driven mindset, learn how we prioritize speed and precision to deliver results quickly and safely Read More ⇾
The Statsig <> Azure AI Integration is a powerful solution for configuring, measuring, and optimizing AI applications. Read More ⇾
Take an inside look at how we built Statsig, and why we handle assignment the way we do. Read More ⇾
Learn the takeaways from Ron Kohavi's presentation at Significance Summit wherein he discussed the challenges of experimentation and how to overcome them. Read More ⇾
Learn how the iconic t-test adapts to real-world A/B testing challenges and discover when alternatives might deliver better results for your experiments. Read More ⇾