Designing for failure

Vineeth Madhusudanan
Sat Dec 18 2021
RELIABILITY-ENGINEERING EXPERIMENTATION RELIABILITY

How Statsig stays up

Statsig serves billions of individual user interactions. Along the way, we designed the service for reliability and availability of your apps that use Statsig. Because of this, in the case where your application cannot reach Statsig for any reason, your application will continue to work exactly as you expect with locally cached values. Read on for how we make this possible.

Server SDKs

How do the Server SDKs return results instantly?
When you initialize a Statsig SDK on your server, the SDK reaches out to Statsig and retrieves definitions for your feature flags, experiments and dynamic configs. Every subsequent feature gate, experiment or dynamic config check is processed locally on your server. The response times for these checks is a fraction of a millisecond. Events uploaded to Statsig from the SDK are batched and will survive transient network connectivity issues.

What happens if there is no connectivity to Statsig?
If your server loses connectivity to Statsig, it’ll happily continue serving results using the cached rule set it has. When connectivity to Statsig is available, it’ll resume checking for updates to your project’s rule set.

What if I need to bootstrap a server, without connectivity to Statsig?
The Statsig SDKs allow you to save the rule sets that have been downloaded to your server and use this to bootstrap servers that come up without Internet connectivity or connectivity to Statsig. When connectivity resumes, the SDKs will refresh this rule set with any changes made since it was saved. (documentation; see bootstrapValues for how to retrieve this config and rulesUpdatedCallback for how to be notified on updates to it).

Watch this 3 minute video for more context!

Client SDKs

How do the Client SDKs return results instantly?
When you initialize a Statsig client SDK, the SDK reaches out to Statsig and retrieves the precomputed values of all feature gates, experiments, and dynamic configs for the current user and caches those values locally. Every subsequent feature gate, experiment or dynamic config check looks up the value in memory. The response times for these checks is a fraction of a millisecond. Events uploaded to Statsig from the SDK are batched and will survive transient network connectivity issues via retries or saving failed log event requests to local storage.

What happens if there is no connectivity to Statsig?
If your client loses connectivity to Statsig, it will fall back to using cached values. If this is a new user who has not had a chance to cache any values, all SDK apis will return their default values: false for gates, empty for experiments and configs. Every experiment or dynamic config is also configured in your code with a default value that serves as a fallback.

Do we need a relay server?
Some vendors provide an onsite relay or proxy to reduce load on their servers. A decade back, outbound internet connectivity was a scarce resource at companies that weren’t digital first. Today this offers low value — and is another potential point of failure to deploy, maintain and monitor. We don’t think a relay server offers value — but if there’s a problem or pain point you’re concerned about, we’d love to hear!

Server infrastructure
Statsig’s infrastructure spans AWS and Azure across multiple availability zones. Most data is returned from in-memory caches, allowing typical server response times well under 50ms. Because server and client SDKs cache values and evaluate locally, your application can continue to function without having to connect to the Statsig servers, except to initialize and then to lazily log events.

To deal with increased demand, we autoscale across our cloud providers. When an availability zone fails for any reason, we seamlessly fail over to surviving availability zones.

Every time we deploy code, we fail out an availability zone to upgrade it. Making failover a core part of our deployment strategy causes this to be exercised regularly, making it very robust. Failovers that aren’t exercised frequently can become fragile. Our approach ensures this isn’t the case.

Statsig is built by builders for builders. Have a question about reliability? Reach out and ask — we’re happy to engage!

Some links to learn more—
Statsig’s availability dashboard
3m video on our client and server SDKs
Statsig’s security posture


Try Statsig Today

Explore Statsig’s smart feature gates with built-in A/B tests, or create an account instantly and start optimizing your web and mobile applications. You can also schedule a live demo or chat with us to design a custom package for your business.

MORE POSTS

Recently published

My Summer as a Statsig Intern

RIA RAJAN

This summer I had the pleasure of joining Statsig as their first ever product design intern. This was my first college internship, and I was so excited to get some design experience. I had just finished my freshman year in college and was still working on...

Read more

Long-live the 95% Confidence Interval

TIMOTHY CHAN

The 95% confidence interval currently dominates online and scientific experimentation; it always has. Yet it’s validity and usefulness is often questioned. It’s called too conservative by some [1], and too permissive by others. It’s deemed arbitrary...

Read more

Realtime Product Observability with Apache Druid

JASON WANG

Statsig’s Journey with Druid This is the text version of the story that we shared at Druid Summit Seattle 2022. Every feature we build at Statsig serves a common goal — to help you better know about your product, and empower you to make good decisions for...

Read more

Quant vs. Qual

MARGARET-ANN SEGER

💡 How to decide between leaning on data vs. research when diagnosing and solving product problems Four heuristics I’ve found helpful when deciding between data vs. research to diagnose + solve a problem. Earth image credit of Moncast Drawing. As a PM, data...

Read more

The Importance of Default Values

TORE

Have you ever sent an email to the wrong person? Well I have. At work. From a generic support email address. To a group of our top customers. Facepalm. In March of 2018, I was working on the games team at Facebook. You may remember that month as a tumultuous...

Read more
ANNOUNCEMENT

CUPED on Statsig

CRAIG

Run experiments with more speed and accuracy We’re pleased to announce the rollout of CUPED for all our customers. Statsig will now automatically use CUPED to reduce variance and bias on experiments’ key metrics. This gives you access to a powerful experiment...

Read more

We use cookies to ensure you get the best experience on our website.

Privacy Policy