Introducing Autotune: Statsig’s Multiarmed Bandit

Timothy Chan
Thu Feb 03 2022
A-B-TESTING DATA-SCIENCE BAYESIAN-STATISTICS THOMPSON-SAMPLING EXPERIMENTATION
Photo by Naser Tamimi on Unsplash

Autotune is Statsig’s implementation of a Multi-Armed Bandit (MAB). MAB is a well-known probability problem that involves balancing exploration vs exploitation (Ref. 2). It’s based on a scenario where a gambler plays several slot machines (aka one-armed bandits) with different and unknown payout odds. The gambler needs a strategy that maximizes winnings by weighing the information they have and deciding whether to play the “best” machine (exploitation) or gather more information from another machine (exploration).

Similar scenarios exist in the online world, typically where some resource (money, users, or time) must be conserved and some payout must be maximized. Examples include:

  1. Determining which product(s) to feature on a one-day Black Friday sale (resource = time, payout = revenue).
  2. Showing the best performing ad given a limited budget (resource = budget, payout = clicks/visits).
  3. Selecting the best signup flow given a finite amount of new users (resource = new users, payout = signups).

It’s also found widespread adoption in automated settings, such as determining the best artwork to display for every Netflix show (Ref. 3).

Comparison with A/B Testing

MABs and A/B Testing are the two most common types of online (digital) testing. There are a few technical differences.

Because of these differences, MABs work especially well in the following scenarios (Ref. 4):

  1. Maximizing Gain: When resources are scarce and maximizing payoff is critical.
  2. Multiple Variations: Bandits are good at focusing traffic on the most promising variations. Bandits can be quite useful vs traditional A/B testing when there are >4 variations.
  3. Well-understood, simple and well-behaved key metric: Bandits work well when there is a single key metric that is a reliable measure of the change being tested. This metric should be well-understood (eg. higher is always better) and produce no worrying downstream interactions or unintended effects. The metric should be stable and immune to temporal variability.
  4. Automation is important: This is important when you want to launch dozens or hundreds of automated tests and/or avoid the decision-making overhead of an A/B test. It’s also critical when you have no estimate of the expected effect size and cannot estimate the test duration.

Paradoxically, Bandit’s work great in both high-risk and low-risk situations. MABs maximize payoffs in high-risk situations, while automating decisions for low-risk situations.

Case Study: A Real Autotune Test on statsig.com

Statsig’s website (www.statsig.com) showcases Statsig’s products and offerings. But because each customer has unique needs, we encourage people to reach out and ask for a live demo. This is important enough to become the website’s primary call-to-action. Internally, we’ve debated the specific wording of the button, but as a hyper-focused startup in build-mode, optimizing our website hasn’t been our highest priority. This is a great situation for using Autotune!

Autotune Setup

The Statsig Console showing Autotune Setup

To setup the test, we used the Statsig Console to create an Autotune experiment and provided the 4 variations we wanted to test, along with specifying the success event (button click). We provide a few parameters to play with, but for most use-cases you can use the defaults like we did:

  • exploration window (default = 24 hrs) — The initial time that Autotune will evenly split traffic. Afterwards Autotune will freely use a probabilistic algorithm to bias traffic towards the winner.
  • attribution window (default = 1 hr) — The maximum time window between an attempt (eg. button impression) and a success event (eg. click) that will count towards Autotune. Adjusting this window can properly capture direct effects or eliminate background noise.
  • winner threshold (default = 95%) — The confidence level Autotune will use to declare a winner and begin diverting 100% of traffic towards.

Adding Autotune to our website relies on two key lines of code:

  1. statsig.getConfig(‘demo_button_text_actual_test’): Fetches an assigned text value for each user. Statsig and Autotune handle user assignment. This call also triggers an exposure which lets Statsig know the user is about to see the button.
  2. statsig.logEvent(‘click’): Logs a successful click. This combined with getConfig() allows Autotune to compute the click-thru rate.

A quick word about our SDK. Statsig’s SDKs are designed for 100% availability, and zero latency. If Statsig’s services go down, your app will not. We wrote about how we can accomplish this in our blog post “Designing for Failure”.

Inside Autotune

With each statsig.getConfig() request, Autotune needs to decide which variation to deliver. While there is randomization at work, we minimize scrambling so that users receive a consistent experience upon reload or a return visit. In general, you can expect that variations are consistent within the hour, and generally robust across several hours.

We have implemented a Bayesian Thompson Sampling algorithm. We did consider another popular choice, UCB-1, but most online comparisons slightly favor Thompson sampling (Ref. 5, 6) and its behavior is nicely differentiated from our other major testing tool, A/B testing.

We chose to implement a learning phase. One common assumption of MABs is that each sample is identical. However we found that even simple click-thru rates can vary throughout a day (and throughout a week). Enforcing a learning phase that evenly splits traffic for at least a day helps build a robust starting point before allocation is adjusted.

We’ve also implemented an attribution window that catches delayed success events which may be many several steps/hours after the impression event (eg. return visit). This allows Autotune to support many of the specialized scenarios requested by our customers.

Results

The Autotune experiment completed in 55 days and was able to identify a winner after 109k impressions even at exceptionally low conversion rates (444 clicks). As a whole, 58% of impressions received the winning variant, much higher than the 25% we would get in a A/B/C/D test. Autotune maximized exposure to the best button during the test.

We provide several charts including a timeseries showing the probability of each variant being the best. It wasn’t a straight-line for “Get a Live Demo” to win.

Conclusion

Autotune selected “Get a Live Demo” for our website (0.46% success rate) which was 53% better than our existing choice and 28% better than the second best option. The test required 55 days, but involved no decision making overhead while diverting 58% of traffic to the best option.

If this had been an A/B/n test, we would have been able to conclude that the winning variation was better than control (p = 0.01, a statistically significant result even with a Bonferroni correction), and that the winning variation was statistically the best (p < 0.03). While the outcome would be the same, MAB delivered two advantages:

  • Under an A/B/C/D test, 75% of the traffic would have been diverted to inferior variations (vs 42% for Autotune).
  • We didn’t have an initial estimate of the click-through rate increase, making it impossible to run a power analysis and estimate how long the test would have taken. Instead of continually peeking at the results, Autotune automated the decision-making process.

About Statsig

Statsig is offers easy-to-use and analytically “smart” product development tools. Want to try Autotune? Signup and try it at Statsig.com. Statsig offers free Developer accounts that come with a generous 5M events a month.

References

  1. Statsig Docs— Autotune (https://docs.statsig.com/autotune)
  2. Wikipedia — Multi-armed Bandit (https://en.wikipedia.org/wiki/Multi-armed_bandit)
  3. Ashok Chandrashekar, Fernando Amat, Justin Basilico and Tony Jebara, The Netflix Technology Blog. “Artwork Personalization at Netflix” (https://netflixtechblog.com/artwork-personalization-c589f074ad76)
  4. Alex Birkett, “When to Run Bandit Tests Instead of A/B/n Tests” (https://cxl.com/blog/bandit-tests/)
  5. Brian Amadio, “Multi-Armed Bandits and the Stitch Fix Experimentation Platform” (https://multithreaded.stitchfix.com/blog/2020/08/05/bandits/)
  6. Steve Robert, “Multi-Armed Bandits: Part 6 — A Comparison of Bandit Algorithms” (https://towardsdatascience.com/a-comparison-of-bandit-algorithms-24b4adfcabb)

Try Statsig Today

Explore Statsig’s smart feature gates with built-in A/B tests, or create an account instantly and start optimizing your web and mobile applications. You can also schedule a live demo or chat with us to design a custom package for your business.

MORE POSTS

Recently published

My Summer as a Statsig Intern

RIA RAJAN

This summer I had the pleasure of joining Statsig as their first ever product design intern. This was my first college internship, and I was so excited to get some design experience. I had just finished my freshman year in college and was still working on...

Read more

Long-live the 95% Confidence Interval

TIMOTHY CHAN

The 95% confidence interval currently dominates online and scientific experimentation; it always has. Yet it’s validity and usefulness is often questioned. It’s called too conservative by some [1], and too permissive by others. It’s deemed arbitrary...

Read more

Realtime Product Observability with Apache Druid

JASON WANG

Statsig’s Journey with Druid This is the text version of the story that we shared at Druid Summit Seattle 2022. Every feature we build at Statsig serves a common goal — to help you better know about your product, and empower you to make good decisions for...

Read more

Quant vs. Qual

MARGARET-ANN SEGER

💡 How to decide between leaning on data vs. research when diagnosing and solving product problems Four heuristics I’ve found helpful when deciding between data vs. research to diagnose + solve a problem. Earth image credit of Moncast Drawing. As a PM, data...

Read more

The Importance of Default Values

TORE

Have you ever sent an email to the wrong person? Well I have. At work. From a generic support email address. To a group of our top customers. Facepalm. In March of 2018, I was working on the games team at Facebook. You may remember that month as a tumultuous...

Read more
ANNOUNCEMENT

CUPED on Statsig

CRAIG

Run experiments with more speed and accuracy We’re pleased to announce the rollout of CUPED for all our customers. Statsig will now automatically use CUPED to reduce variance and bias on experiments’ key metrics. This gives you access to a powerful experiment...

Read more

We use cookies to ensure you get the best experience on our website.

Privacy Policy