There are few different routes you could fundamentally take to achieve this, and they each have some of their own pros and cons:
One reasonable idea might be just to have a giant table of all of your users, and what group you would put them in for the flag. There are actually some clear upsides here.
For analysis, you already have a nice table of tests and controls to use, and making changes to a given user’s assignment is as simple as updating a row in that table. However, the latency implications of this method are severe: When you want to check a gate or an experiment, you’re performing a lookup in a database.
There’s another issue with that approach, as well. When you’re doing your analysis, you’ll be suffering from massive overexposure.
That is to say, if you have a change that only 10% of your users actually encountered, you’re still computing differences between control and test like every single user you have run into this change. This adds quite a bit of noise to your experiment, and will make getting accurate results take quite a bit longer.
Our approach to solving that issue is to only have a table of the users who actually saw the experiment. When you check an experiment, we log an “exposure” event to populate that table. Doing it at the time of check nicely ensures that there’s minimal room for differences in experiment behavior to cause differences in logging, helping prevent issues like Sample Ratio Mismatch from arising.
On the assignment piece, we have assignment be completely deterministic.
The simplest version of this you could imagine is saying, “Users who have an even ID get control, and users who have an odd ID get tested.” This obviates the need for any sort of database lookup or keeping a gigantic list in memory.
Of course, there’s the obvious pitfall with that heuristic that every experiment would have the same split, removing the randomization component that’s so critical to cogent analyses. The answer, then, is to have each experiment have its own deterministic split.
We accomplish this by generating a salt for each experiment, combining that with the given ID we’re checking the experiment for, and computing a Sha256 hash of that. Rather than a simple even-odd check on that, we calculate it modulo 10,000 to allow for finer-grained control of percentages.
Choosing which buckets to give what value is itself an interesting problem as well. For gates—where we expect rolling up and down the percentage of a feature—we want that to be as deterministic as possible, so it simply is when you have a 30% rollout, the first 30% of buckets will correspond to passing the gate.
For experiments in a layer, on the other hand—where you might be running multiple iterations of different experiences—it’s actually preferable that each time you go to 30% of users, it is a different 30% of users - so we have the buckets assigned randomly there.
Learn the takeaways from Ron Kohavi's presentation at Significance Summit wherein he discussed the challenges of experimentation and how to overcome them. Read More ⇾
Learn how the iconic t-test adapts to real-world A/B testing challenges and discover when alternatives might deliver better results for your experiments. Read More ⇾
See how we’re making support faster, smarter, and more personal for every user by automating what we can, and leveraging real, human help from our engineers. Read More ⇾
Marketing platforms offer basic A/B testing, but their analysis tools fall short. Here's how Statsig helps you bridge the gap and unlock deeper insights. Read More ⇾
When Instagram Stories rolled out, many of us were left behind, giving us a glimpse into the secrets behind Meta’s rollout strategy and tech’s feature experiments. Read More ⇾
Automation in A/B testing has freed data scientists from routine tasks—so what’s next? Ronny Kohavi shares insights on where the real value lies. Read More ⇾