You know that meme: “If your infra dev got hit by a bus, would your company survive?” Four years ago, our answer was a solid no. Fast forward to now, and we’ve successfully scaled out to stream over 2 trillion events a day. (Thanks for staying alive all this time, Jason.)
Back in the early days as a brand new startup, we needed to get POCs out the door, so our infrastructure was shipped equally fast to just get it done. It wasn't long until we found product-market fit and started working with companies like OpenAI, Atlassian, and Figma. Huge companies were depending on us, and our infrastructure got tested constantly.
There was a rhythm to it, almost: we’d fix an issue, a new customer would onboard, and we’d be back to firefighting all over again. With SEVs happening every week, we were constantly caught in a dance between achieving stability and moving quickly.
It was clear that we needed to take a step back and build the right set of tools for our stack to mature. Infrastructure as code (IaC) frameworks like Pulumi were the first step for us. As a fast-growing company, we knew that our infra needs would also constantly change and evolve over time, so we took an approach to build something dynamic that would also grow and change with us.
In this post, we’ll share how we solved these issues by building a highly customized framework using Pulumi, Docker, and Argo CD to create a truly self-service, safe, and extendable infra stack. We’ll outline the brutal scaling problems we faced, walk through the high-level solution, deep-dive into our system design, and showcase the developer experience.
Imagine yourself as an engineer in the early days of Statsig. To launch a new service or region, you were already bottlenecked because you had to message an engineer (Jason) to set up a service for you, since he was the only one at the time who had all the context of our internal systems.
Jason would set up the Kubernetes configuration YAMLs, manually create an Argo app to point to those files, and then go to our cloud portal and configure any cloud resources that the new service needed. Now, if you multiply this process for each Deployment Regions and then for each Development tiers (i.e latest, staging, prod), you can see why this was a problem.
We knew this was unsustainable, but it was the scrappy approach we had to take to get things off the ground as a startup. As we described in the intro, though, this was not the greatest developer experience - both for us internally, and our external customers depending on us.
Our first major problem was on the deployment configuration (K8s) side. It was highly manual and hard to debug; testing was tricky, and debug logs/info were not easy to decipher. They were also risk-prone since live services could get affected almost immediately.
We also had no typing or value validation, so it was entirely possible to just “fat finger” typos. This high maintenance, low flexibility setup required manual editing, lots of copy-pasting, and complex syntax for customization. For example, if you were a developer seeing this code, it felt like choosing between the black wire and the red wire to cut if you had a time bomb in front of you:
There was even one time when someone accidentally set production services to connect to our latest (dev-stage) Redis instance instead of the correct prod one. The root cause turned out to be a hand-edited YAML file with a faulty copy-paste and no validation. You can imagine the pain of manually Ctrl+F’ing our config files just to make sure everything matched.
Our second problem was that services and cloud dependencies were disjoint and inconsistent. They were manually spun up via cloud portal UIs, which led to inconsistent naming and resource fields set across teams. This also meant that we had no guardrails in place. Anyone with credentials could literally just delete a live cluster (which may or may not have happened once… 👉👈), and we had no way to preview what was being configured or set up.
Lastly, since everything was so disjoint, there was a lack of proper context when setting things up. The relationship between a service and resources was not concrete, so it was possible to miss a critical dependency.
We knew these risks were real, and we knew that we needed to do better. The reality, however, is that as a start-up, our team had 5 devs juggling the entire company’s backend API development, Cloud Infrastructure, DevOps and deployments for all other partner teams. We just didn’t have the resources to scale out our team at the time, so we reached a decision point. It was time to build a tool that would help us move faster and safer.
We started asking ourselves: Could we build something that would allow anyone to just spin up services? Could we work less, achieve more, and do so safely with a fraction of the time?
The answer was yes, but it took a lot of clever design decisions and tooling.
At a high level, our solution uses Pulumi to “code-ify” every piece of infrastructure and make the rest of the system possible. Then, we wired all of our ”pulumified” pieces into our CI pipeline, which automatically reads versioned Docker images, runs pulumi up
, and generates the exact manifests needed for each environment. This makes infra fully self-service: our engineers never have to hand-edit YAML or click through consoles to spin up new stacks again.
Once Pulumi emits those GitOps manifests, Argo CD takes over and continuously watches for changes and streaming versioned updates into every live cluster. In practice, that means any change in our main ops repo is automatically applied - no manual rollouts, no drift, and instant rollbacks by reverting a Git commit.
You can think of this process in three main phases. We’ll walk through each in detail, but in a nutshell:
Build phase. A developer pushes application code for Service X in the Statsig repo. CI builds and pushes a Docker image to DockerHub, then writes that image tag into a remote Version Datastore.
Cloud provisioning phase. CI triggers pulumi up
in our OPS Repo, and Pulumi provisions or updates infrastructure.
Service deployment phase. Pulumi auto-generates our service configurations (YAML files) and Argo CD rolls out those manifests.
By wiring it all together, we enable end‑to‑end automation from code commit to live deployment and maintain true versioning, allow rollbacks, and a whole lot of other cool stuff. In the next section, we’ll dive deeper into how each of those three phases works.
1. First, a developer pushes changes to a repo (call it Service X).
2. The CI pipeline then starts a job that does two things. It builds and pushes the binary to DockerHub (based on Docker image) and creates the DockerHub tag. It also generates an image identifying tag into a Version Datastore (i.e., repo, GCS bucket, or Statsig).
3. In this part, downstream CI/CD triggers deployment jobs in the OPS repo (where our Pulumi code lives). This is what enables us to preview changes, set alerts for failures, create deletion protection guardrails deletion, and unlocks a bunch of other observability tools.
4. Pulumi runs multiple stacks in parallel (e.g., Latest, Staging, Prod-region-A). Each stack aligns the live state of cloud resources (Kubernetes clusters, networks, datastores) with the state specified by the Pulumi code in our OPS repo. The stacks pull the image binary tag from Version Datastore and auto-generates Kubernetes manifests into the Autogen Ops Repo.
5. Each stack concludes by emitting Argo CD Application Kubernetes YAML specs powered by the flexibility of Pulumi’s CustomResource
provider. The YAML spec file includes the repository URL, path to manifests, image tag (ensuring deploys use the exact version), target cluster, namespace, and sync policy.
6. Argo CD polls and applies updates using a canary strategy. At this point, Argo CD constantly checks and monitors our Ops Repo for changes and rolls them out to apply service Kubernetes configurations.
In addition to making infra automated and accessible to our partner teams, another big goal of ours was to make the developer experience as smooth, seamless, and approachable as possible. Here's an example PR with tools to empower the user:
On the left, developers can make changes to the TypeScript code, which is the part that automatically generates K8s manifest files in a pulumi-gen directory. This saves people from having to manually edit service deployment configs by hand! It's a simple example, but for larger-scale changes, the PR can come up with thousands of files for you.
Another nice tool is that in the Discussion Area, your PR automatically gets marked with Previews for cloud resources so you can see what the CI/CD will do if you merge your PR in. This gives developers visibility into their changes without having to check it in. From this screenshot, for example, I know that this diff I'm checking in will leave prod resources completely untouched:
Lastly, on the right, we have automated actions running for each PR to check builds. We also have this action that checks for deletions and, if there are any, it marks your PR with a warning to give a heads up to engineers about unintended behavior. Cloud resources leverage Pulumi’s protection flag so even if the code gets checked in, deletion will not happen unless manual action is taken, as shown in the description:
We also have guardrails set up for the worst case scenarios. If undesirable code does get checked in and there is an error, or if some of the resources are almost deleted (we set up protections so that they'll throw an error in the run), we have metrics and Topline Alerts powered by Statsig. When the stack tries to run and fails, we keep track of metrics:
These are connected to an alerting service that notifies our on-call Slack channel, allowing us to address the issue immediately:
Pulumi was the first step in our platform engineering journey. It’s sort of like that “main” skill that you level up in a video game to unlock the rest of the skill tree in our infra stack. It’s only after we “pulumified” our infra that we are now able programmatically configure our deployments and ship all of the above, plus other cool features like:
Automated regional rollouts, powered by Statsig Release Pipelines
Service traffic sharding
Cost-based VM selection automation
We’ve only scratched the surface of what our new setup can do. Other projects coming up next include multi-cloud, policy-as-code, Statsig Experimentation for infra, per-company cluster deployment scale, and much, much more.
We’re really excited for all the other convenient tools and systems that this framework has unlocked for us, and we hope this can inspire other dev teams hitting these scaling problems! If there are any parts you found particularly interesting that you want more details on, reach out to us in our community Slack and we’ll follow up with another post on it on the Statsig Engineering Blog.