MAB is a well-known probability problem that involves balancing exploration vs exploitation (Ref. 2). It’s based on a scenario where a gambler plays several slot machines (aka one-armed bandits) with different and unknown payout odds. The gambler needs a strategy that maximizes winnings by weighing the information they have and deciding whether to play the “best” machine (exploitation) or gather more information from another machine (exploration).
Similar scenarios exist in the online world, typically where some resource (money, users, or time) must be conserved and some payout must be maximized. Examples include:
Determining which product(s) to feature on a one-day Black Friday sale (resource = time, payout = revenue).
Showing the best performing ad given a limited budget (resource = budget, payout = clicks/visits).
Selecting the best signup flow given a finite amount of new users (resource = new users, payout = signups).
It’s also found widespread adoption in automated settings, such as determining the best artwork to display for every Netflix show (Ref. 3).
MABs and A/B Testing are the two most common types of online (digital) testing. There are a few technical differences.
Because of these differences, MABs work especially well in the following scenarios (Ref. 4):
Maximizing Gain: When resources are scarce and maximizing payoff is critical.
Multiple Variations: Bandits are good at focusing traffic on the most promising variations. Bandits can be quite useful vs traditional A/B testing when there are >4 variations.
Well-understood, simple and well-behaved key metric: Bandits work well when there is a single key metric that is a reliable measure of the change being tested. This metric should be well-understood (eg. higher is always better) and produce no worrying downstream interactions or unintended effects. The metric should be stable and immune to temporal variability.
Automation is important: This is important when you want to launch dozens or hundreds of automated tests and/or avoid the decision-making overhead of an A/B test. It’s also critical when you have no estimate of the expected effect size and cannot estimate the test duration.
Paradoxically, Bandits work great in both high-risk and low-risk situations. MABs maximize payoffs in high-risk situations, while automating decisions for low-risk situations.
Statsig’s website (www.statsig.com) showcases Statsig’s products and offerings. But because each customer has unique needs, we encourage people to reach out and ask for a live demo. This is important enough to become the website’s primary call-to-action. Internally, we’ve debated the specific wording of the button, but as a hyper-focused startup in build-mode, optimizing our website hasn’t been our highest priority. This is a great situation for using Autotune!
To setup the test, we used the Statsig Console to create an Autotune experiment and provided the 4 variations we wanted to test, along with specifying the success event (button click). We provide a few parameters to play with, but for most use-cases you can use the defaults like we did:
exploration window (default = 24 hrs) — The initial time that Autotune will evenly split traffic. Afterwards Autotune will freely use a probabilistic algorithm to bias traffic towards the winner.
attribution window (default = 1 hr) — The maximum time window between an attempt (eg. button impression) and a success event (eg. click) that will count towards Autotune. Adjusting this window can properly capture direct effects or eliminate background noise.
winner threshold (default = 95%) — The confidence level Autotune will use to declare a winner and begin diverting 100% of traffic towards.
Adding Autotune to our website relies on two key lines of code:
statsig.getConfig(‘demo_button_text_actual_test’): Fetches an assigned text value for each user. Statsig and Autotune handle user assignment. This call also triggers an exposure which lets Statsig know the user is about to see the button.
statsig.logEvent(‘click’): Logs a successful click. This combined with getConfig() allows Autotune to compute the click-thru rate.
A quick word about our SDK. Statsig’s SDKs are designed for 100% availability, and zero latency. If Statsig’s services go down, your app will not. We wrote about how we can accomplish this in our blog post “Designing for Failure”.
With each statsig.getConfig() request, Autotune needs to decide which variation to deliver. While there is randomization at work, we minimize scrambling so that users receive a consistent experience upon reload or a return visit. In general, you can expect that variations are consistent within the hour, and generally robust across several hours.
We have implemented a Bayesian Thompson Sampling algorithm. We did consider another popular choice, UCB-1, but most online comparisons slightly favor Thompson sampling (Ref. 5, 6) and its behavior is nicely differentiated from our other major testing tool, A/B testing.
We chose to implement a learning phase. One common assumption of MABs is that each sample is identical. However we found that even simple click-thru rates can vary throughout a day (and throughout a week). Enforcing a learning phase that evenly splits traffic for at least a day helps build a robust starting point before allocation is adjusted.
We’ve also implemented an attribution window that catches delayed success events which may be many several steps/hours after the impression event (eg. return visit). This allows Autotune to support many of the specialized scenarios requested by our customers.
The Autotune experiment completed in 55 days and was able to identify a winner after 109k impressions even at exceptionally low conversion rates (444 clicks). As a whole, 58% of impressions received the winning variant, much higher than the 25% we would get in a A/B/C/D test. Autotune maximized exposure to the best button during the test.
We provide several charts including a timeseries showing the probability of each variant being the best. It wasn’t a straight-line for “Get a Live Demo” to win.
Autotune selected “Get a Live Demo” for our website (0.46% success rate) which was 53% better than our existing choice and 28% better than the second best option. The test required 55 days, but involved no decision making overhead while diverting 58% of traffic to the best option.
If this had been an A/B/n test, we would have been able to conclude that the winning variation was better than control (p = 0.01, a statistically significant result even with a Bonferroni correction), and that the winning variation was statistically the best (p < 0.03). While the outcome would be the same, MAB delivered two advantages:
Under an A/B/C/D test, 75% of the traffic would have been diverted to inferior variations (vs 42% for Autotune).
We didn’t have an initial estimate of the click-through rate increase, making it impossible to run a power analysis and estimate how long the test would have taken. Instead of continually peeking at the results, Autotune automated the decision-making process.
Statsig is offers easy-to-use and analytically “smart” product development tools. Want to try Autotune? Signup and try it at Statsig.com. Statsig offers free Developer accounts that come with a generous 5M events a month.
Wikipedia —Multi-armed Bandit
Ashok Chandrashekar, Fernando Amat, Justin Basilico and Tony Jebara, The Netflix Technology Blog. “Artwork Personalization at Netflix”
Alex Birkett, “When to Run Bandit Tests Instead of A/B/n Tests”
Brian Amadio, “Multi-Armed Bandits and the Stitch Fix Experimentation Platform”
Steve Robert, “Multi-Armed Bandits: Part 6 — A Comparison of Bandit Algorithms”
Thanks to Naser Tamimi on Unsplash for the cover image!
Thanks to our support team, our customers can feel like Statsig is a part of their org and not just a software vendor. We want our customers to know that we're here for them.
Migrating experimentation platforms is a chance to cleanse tech debt, streamline workflows, define ownership, promote democratization of testing, educate teams, and more.
Calculating the right sample size means balancing the level of precision desired, the anticipated effect size, the statistical power of the experiment, and more.
The term 'recency bias' has been all over the statistics and data analysis world, stealthily skewing our interpretation of patterns and trends.
A lot has changed in the past year. New hires, new products, and a new office (or two!) GB Lee tells the tale alongside pictures and illustrations:
A deep dive into CUPED: Why it was invented, how it works, and how to use CUPED to run experiments faster and with less bias.