It’s time for another exploration of how we use Statsig to build Statsig. In this post, we’ll dive into how we run experiments on our NestJS API servers to reduce request processing time and CPU usage.
This type of experiment was ubiquitous at Facebook - during periods of high utilization, many engineers would be looking for potential performance improvements or features that could be disabled to reduce the load on the limited infrastructure. Facebook instrumented its backend php web servers with metrics for CPU usage and request processing time, which made it easy for engineers across the company to measure the impact of a potential performance improvement. We’ve done the same for our NestJS app, which has simplified the process of testing and roll out changes that improve API latency for customers across the board.
The first implementations of our SDKs exposed asynchronous APIs to evaluate gates, dynamic configs, experiments, and layers. Over time, we removed this limitation. The same existed in our backend, which evaluates an entire project given a user, to serve the /initialize endpoint for client SDKs.
When we removed the async nature of that evaluation, we didn’t revisit the code to clean up steps that could be eliminated entirely. When I noticed some of this unnecessary work, I knew there was a potential to improve performance on our backend, but I wasn’t sure how much of an impact it would have. So I ran an experiment to measure it!
Adding a feature gate is a quick and easy way to measure the impact of any change that you likely would have needed the ability to toggle separately from code release anyway. Our backend is already instrumented with a Statsig SDK, so it was trivial to add another gate check. This made it easy to verify the new behavior was correct, measure the impact of the change, and have the ability to turn it off if necessary.
In addition, we already have performance metrics logged via the Statsig SDK.
We read CPU metrics from /sys/fs/cgroup/cpuacct.stat
, and memory metrics from /sys/fs/cgroup/memory/memory.stat
and /sys/fs/cgroup/memory/memory.kmem.usage_in_bytes
. These get aggregated, logged to Statsig, and define our average CPU and memory metrics.
We also define an api_latency
metric at the pod level, which reads the api_request
event for successful status codes, and averages the latency per pod. We log the api_request
metric via a nestjs interceptor on every request.
At first, when you look at the results, it seems a bit underwhelming. There isn’t any impact to API latency, though there was a slight improvement to CPU usage.
However, these CPU and request latency metrics are fleet-wide - meaning metrics from services which didnt even serve the endpoint that was changing are included in the top level experiment results. Since the change we made only impacted the v1/initialize
endpoint which our client SDKs use, we needed to filter the results down to see the true impact.
So, we opened up the “explore” tab in the experiment results section to write a custom query that would filter the results down to the relevant servers.
As you can see here, once we filtered down to only the pods serving /v1/initialize traffic, this was a huge win! 4.90% ±1.0% decrease to average API latency on those pods, and 1.90% ±0.70% decrease in CPU usage!
These types of experiments can have a dramatic impact on the performance of our customers integrations, and the end users’ experience in apps that use Statsig. They also impact our costs and ability to scale as usage grows.
Fortunately, I was able to “stand on the shoulders of giants” - someone had already hooked up the Statsig node SDK, logged events for CPU usage and request latency, and created metrics for these in Statsig. Doing this sort of work up front empowers everyone in your team, organization, or company to build, measure, and ship incremental wins much much faster.
Happy experimenting!
Understand the difference between one-tailed and two-tailed tests. This guide will help you choose between using a one-tailed or two-tailed hypothesis! Read More ⇾
This guide explains why the allocation point may differ from the exposure point, how it happens, and what you to do about it. Read More ⇾
From continuous integration and deployment to a scrappy, results-driven mindset, learn how we prioritize speed and precision to deliver results quickly and safely Read More ⇾
The Statsig <> Azure AI Integration is a powerful solution for configuring, measuring, and optimizing AI applications. Read More ⇾
Take an inside look at how we built Statsig, and why we handle assignment the way we do. Read More ⇾
Learn the takeaways from Ron Kohavi's presentation at Significance Summit wherein he discussed the challenges of experimentation and how to overcome them. Read More ⇾