A/B Testing performance wins on NestJS API servers

Tue Jul 09 2024

Stephen Royal

Software Engineer, Statsig

It’s time for another exploration of how we use Statsig to build Statsig. In this post, we’ll dive into how we run experiments on our NestJS API servers to reduce request processing time and CPU usage.

This type of experiment was ubiquitous at Facebook - during periods of high utilization, many engineers would be looking for potential performance improvements or features that could be disabled to reduce the load on the limited infrastructure. Facebook instrumented its backend php web servers with metrics for CPU usage and request processing time, which made it easy for engineers across the company to measure the impact of a potential performance improvement. We’ve done the same for our NestJS app, which has simplified the process of testing and roll out changes that improve API latency for customers across the board.

The change

The first implementations of our SDKs exposed asynchronous APIs to evaluate gates, dynamic configs, experiments, and layers. Over time, we removed this limitation. The same existed in our backend, which evaluates an entire project given a user, to serve the /initialize endpoint for client SDKs.

When we removed the async nature of that evaluation, we didn’t revisit the code to clean up steps that could be eliminated entirely. When I noticed some of this unnecessary work, I knew there was a potential to improve performance on our backend, but I wasn’t sure how much of an impact it would have. So I ran an experiment to measure it!

The setup

Adding a feature gate is a quick and easy way to measure the impact of any change that you likely would have needed the ability to toggle separately from code release anyway. Our backend is already instrumented with a Statsig SDK, so it was trivial to add another gate check. This made it easy to verify the new behavior was correct, measure the impact of the change, and have the ability to turn it off if necessary.

In addition, we already have performance metrics logged via the Statsig SDK.

We read CPU metrics from /sys/fs/cgroup/cpuacct.stat, and memory metrics from /sys/fs/cgroup/memory/memory.stat and /sys/fs/cgroup/memory/memory.kmem.usage_in_bytes. These get aggregated, logged to Statsig, and define our average CPU and memory metrics.

We also define an api_latency metric at the pod level, which reads the api_request event for successful status codes, and averages the latency per pod. We log the api_request metric via a nestjs interceptor on every request.

Determining the impact: the results

At first, when you look at the results, it seems a bit underwhelming. There isn’t any impact to API latency, though there was a slight improvement to CPU usage.

Cumulative exposures 1

However, these CPU and request latency metrics are fleet-wide - meaning metrics from services which didnt even serve the endpoint that was changing are included in the top level experiment results. Since the change we made only impacted the v1/initialize endpoint which our client SDKs use, we needed to filter the results down to see the true impact.

So, we opened up the “explore” tab in the experiment results section to write a custom query that would filter the results down to the relevant servers.

Cumulative Exposures 2

As you can see here, once we filtered down to only the pods serving /v1/initialize traffic, this was a huge win! 4.90% ±1.0% decrease to average API latency on those pods, and 1.90% ±0.70% decrease in CPU usage!

API Latency 1

These types of experiments can have a dramatic impact on the performance of our customers integrations, and the end users’ experience in apps that use Statsig. They also impact our costs and ability to scale as usage grows.

Fortunately, I was able to “stand on the shoulders of giants” - someone had already hooked up the Statsig node SDK, logged events for CPU usage and request latency, and created metrics for these in Statsig. Doing this sort of work up front empowers everyone in your team, organization, or company to build, measure, and ship incremental wins much much faster.

Happy experimenting!

Create a free account

You're invited to create a free Statsig account! Get started today with 2M free events. No credit card required, of course.
an enter key that says "free account"

Try Statsig Today

Get started for free. Add your whole team!
We use cookies to ensure you get the best experience on our website.
Privacy Policy