Autotune is Statsig’s implementation of a Multi-Armed Bandit (MAB). MAB is a well-known probability problem that involves balancing exploration vs exploitation (Ref. 2). It’s based on a scenario where a gambler plays several slot machines (aka one-armed bandits) with different and unknown payout odds. The gambler needs a strategy that maximizes winnings by weighing the information they have and deciding whether to play the “best” machine (exploitation) or gather more information from another machine (exploration).
Similar scenarios exist in the online world, typically where some resource (money, users, or time) must be conserved and some payout must be maximized. Examples include:
It’s also found widespread adoption in automated settings, such as determining the best artwork to display for every Netflix show (Ref. 3).
MABs and A/B Testing are the two most common types of online (digital) testing. There are a few technical differences.
Because of these differences, MABs work especially well in the following scenarios (Ref. 4):
Paradoxically, Bandit’s work great in both high-risk and low-risk situations. MABs maximize payoffs in high-risk situations, while automating decisions for low-risk situations.
Statsig’s website (www.statsig.com) showcases Statsig’s products and offerings. But because each customer has unique needs, we encourage people to reach out and ask for a live demo. This is important enough to become the website’s primary call-to-action. Internally, we’ve debated the specific wording of the button, but as a hyper-focused startup in build-mode, optimizing our website hasn’t been our highest priority. This is a great situation for using Autotune!
To setup the test, we used the Statsig Console to create an Autotune experiment and provided the 4 variations we wanted to test, along with specifying the success event (button click). We provide a few parameters to play with, but for most use-cases you can use the defaults like we did:
Adding Autotune to our website relies on two key lines of code:
A quick word about our SDK. Statsig’s SDKs are designed for 100% availability, and zero latency. If Statsig’s services go down, your app will not. We wrote about how we can accomplish this in our blog post “Designing for Failure”.
With each statsig.getConfig() request, Autotune needs to decide which variation to deliver. While there is randomization at work, we minimize scrambling so that users receive a consistent experience upon reload or a return visit. In general, you can expect that variations are consistent within the hour, and generally robust across several hours.
We have implemented a Bayesian Thompson Sampling algorithm. We did consider another popular choice, UCB-1, but most online comparisons slightly favor Thompson sampling (Ref. 5, 6) and its behavior is nicely differentiated from our other major testing tool, A/B testing.
We chose to implement a learning phase. One common assumption of MABs is that each sample is identical. However we found that even simple click-thru rates can vary throughout a day (and throughout a week). Enforcing a learning phase that evenly splits traffic for at least a day helps build a robust starting point before allocation is adjusted.
We’ve also implemented an attribution window that catches delayed success events which may be many several steps/hours after the impression event (eg. return visit). This allows Autotune to support many of the specialized scenarios requested by our customers.
The Autotune experiment completed in 55 days and was able to identify a winner after 109k impressions even at exceptionally low conversion rates (444 clicks). As a whole, 58% of impressions received the winning variant, much higher than the 25% we would get in a A/B/C/D test. Autotune maximized exposure to the best button during the test.
We provide several charts including a timeseries showing the probability of each variant being the best. It wasn’t a straight-line for “Get a Live Demo” to win.
Autotune selected “Get a Live Demo” for our website (0.46% success rate) which was 53% better than our existing choice and 28% better than the second best option. The test required 55 days, but involved no decision making overhead while diverting 58% of traffic to the best option.
If this had been an A/B/n test, we would have been able to conclude that the winning variation was better than control (p = 0.01, a statistically significant result even with a Bonferroni correction), and that the winning variation was statistically the best (p < 0.03). While the outcome would be the same, MAB delivered two advantages:
Statsig is offers easy-to-use and analytically “smart” product development tools. Want to try Autotune? Signup and try it at Statsig.com. Statsig offers free Developer accounts that come with a generous 5M events a month.
Explore Statsig’s smart feature gates with built-in A/B tests, or create an account instantly and start optimizing your web and mobile applications. You can also schedule a live demo or chat with us to design a custom package for your business.
This summer I had the pleasure of joining Statsig as their first ever product design intern. This was my first college internship, and I was so excited to get some design experience. I had just finished my freshman year in college and was still working on...
The 95% confidence interval currently dominates online and scientific experimentation; it always has. Yet it’s validity and usefulness is often questioned. It’s called too conservative by some , and too permissive by others. It’s deemed arbitrary...
Statsig’s Journey with Druid This is the text version of the story that we shared at Druid Summit Seattle 2022. Every feature we build at Statsig serves a common goal — to help you better know about your product, and empower you to make good decisions for...
💡 How to decide between leaning on data vs. research when diagnosing and solving product problems Four heuristics I’ve found helpful when deciding between data vs. research to diagnose + solve a problem. Earth image credit of Moncast Drawing. As a PM, data...
Have you ever sent an email to the wrong person? Well I have. At work. From a generic support email address. To a group of our top customers. Facepalm. In March of 2018, I was working on the games team at Facebook. You may remember that month as a tumultuous...
Run experiments with more speed and accuracy We’re pleased to announce the rollout of CUPED for all our customers. Statsig will now automatically use CUPED to reduce variance and bias on experiments’ key metrics. This gives you access to a powerful experiment...