Statsig’s Journey with Druid
This is the text version of the story that we shared at Druid Summit Seattle 2022.
Every feature we build at Statsig serves a common goal — to help you better know about your product, and empower you to make good decisions for your product. Using Statsig, product teams should feel comfortable about understanding their product performance and gaining insights through data. This is the power of Product Observability. You may be able to find numerous definitions of the term on Google. In our word, it is a proactive, continuous and adaptive approach to measure the impact of what you build, and to use data to learn what to build next. Experiments, Metrics, Ultrasound, etc., are built to fulfill the observability need.
There is also a need to work with live data as soon as it lands. That is why, in June, we launched Events Explorer, a realtime observability tool to help with understanding product data with no delay. Following Statsig’s speed of product launches, it took our team less than three months to turn the idea into a publicly-available and production-ready product. To make it happen, we stood on the shoulders of giants —by leveraging Apache Druid as our realtime data engine. In this blog, we will share our journey with Druid in the past few months.
Druid is an open source distributed data store managed by the Apache Foundation. It is mostly used in realtime data application, for example, fraud detection, ads analysis and recommendation, etc., where the applications require high-volume realtime data ingestion and low latency data query capabilities. Our need of realtime events and metrics data analysis is also a great fit.
There are a lot of online resources that discuss about Druid in detail. The Druid website is a good starting point. For context, let’s briefly take look at how Druid works.
At a high level, Druid does three main tasks: Ingestion, Storage and Query. Druid embraces a distributed architecture that can handle scalability and tolerate failure well. Using the diagram above, let’s talk through what each component does:
In Master servers,
Metadata about the system is stored in an external metadata storage to keep the current state of the configuration.
In Query servers,
In Data servers,
Besides being stored in Historicals for better query performance, data is also stored in a deep storage (e.g., AWS S3, Azure Blob Storage) for data durability.
Zookeeper serves to keep the realtime state of Druid that involves leader election, data management, task management, and so on.
Our team spent some time doing research and comparing similar products in the market. Besides its core functionalities, there are three main reasons why we ended up choosing Druid.
Many early users of Druid may be surprised at this reason. It can be a large amount of work to onboard and start using Druid, due to the number of moving parts and configuration options. Unfortunately, the complexity of Druid still exists today. But, on the bright side, the Druid community has done much work to improve the onboarding experience. Our first working cluster for testing and validation was spun up in a few hours by referring to the quick start guide and example configurations.
In addition, the community support is fantastic, on both Slack and the forum. Questions about Druid are often answered by an expert in the community within a short amount of time. The strong community support has saved us a lot of effort as a new user. With the help of the community and Druid’s thorough documentation, our onboarding experience was much smoother than what we originally expected.
On the opposite side of the complexity of a distributed system, flexibility is an advantage of it. Druid is no exception. Because of the way it is designed, each component can be managed respectively depending on its requirement and functionality. It also means that upgrades to the system are no longer intimidating compared to a monolith system. This is super helpful for us during daily operations, as the Infra team can make changes to Druid while not slowing down the product development progress on top of it.
As a product analytics company, we cannot allow data flow to slow down. Therefore, a high-performance data system is crucial to us. Both the ingestion pipeline and the query experience are performant and meet our needs. Once the ingestion is configured, there is no delay to when you see your data in the datastore and ready for query. There are benchmark results that can attest to this as well.
Although we have only been using Druid for several months, there are some learning that we think is valuable to share with the community.
At Statsig, the majority of our workloads are running inside Kubernetes across many regions. To help manage all those workloads, we adopted the GitOps approach in our daily operations. In short, GitOps leverages the DevOps practices (version control, collaboration, CI/CD, etc.) for infrastructure automation.
We use the same approach to manage Druid in our clusters and across environments. It facilitates collaboration by offering better clarity and transparency on what changes are made to the system. Rollbacks of the changes are possible when things go unexpectedly.
On top of GitOps, we also use the community-supported Druid Operator to help us manage the lifecycle and Kubernetes resources in a controlled way. It offers features, such as, rolling deploy that contributes to the uptime of Druid during upgrades, autoscaling support, and so on. Therefore, we are able to focus on tuning the parameters that directly impact end-user experience.
James Martin wrote in his book Systems Engineering Guidebook — a Process For Developing Systems And Products, “There is no perfect system, and probably never will be”. The same also applies to tuning Druid. There is never a perfect formula that works for every use case. The only correct answer that applies to Druid configuration questions is likely — “It depends”.
Like mentioned above, the example configuration for Druid can be a good starting point. But it may not work for your actual use case. Realizing that, we adopted the metrics-driven approach to tune Druid. Internally, we identified a few key metrics that can affect product performance. Druid already emits many metrics out of the box. Alongside those existing metrics, we define our own product metrics (e.g., Events Explorer query success rate, query latency) and infrastructure resource metrics (e.g., pod health, node usage, storage usage) that we keep track of. Objectives are created for the metrics, and all of the work we do to make Druid work better is to achieve those objectives.
It means that the work on Druid is continuous and adaptive based on what our product needs are. It saves our engineering time and efforts, and engineers can better decide on important things to work on.
A powerful attribute of Druid’s query is the query ID. It is a UUID that is provided as queryId in the queryContext for each query. By default, it is a UUID autogenerated by Druid. Application built on top of Druid can provide a custom queryId to when it sends the query to Druid. The queryId is available throughout the lifecycle of the query, e.g. HTTP request, Druid metrics. The queryId can also be included in the product telemetry (e.g. Statsig events). Thus, the end user query experience can be fetched using the queryId . It becomes handy in troubleshooting and monitoring as well.
There are many more we plan to do with Druid at Statsig. Just to name a few,
Our journey with Druid has just begun. The Statsig team is super excited about creating new features powered by Druid. We will share more stories with Druid as we progress along. If you are interested, we welcome you to join us along this journey, either as a user or a team member!
If you would like to chat more, you can find us on Slack.
Explore Statsig’s smart feature gates with built-in A/B tests, or create an account instantly and start optimizing your web and mobile applications. You can also schedule a live demo or chat with us to design a custom package for your business.
This summer I had the pleasure of joining Statsig as their first ever product design intern. This was my first college internship, and I was so excited to get some design experience. I had just finished my freshman year in college and was still working on...
The 95% confidence interval currently dominates online and scientific experimentation; it always has. Yet it’s validity and usefulness is often questioned. It’s called too conservative by some , and too permissive by others. It’s deemed arbitrary...
Statsig’s Journey with Druid This is the text version of the story that we shared at Druid Summit Seattle 2022. Every feature we build at Statsig serves a common goal — to help you better know about your product, and empower you to make good decisions for...
💡 How to decide between leaning on data vs. research when diagnosing and solving product problems Four heuristics I’ve found helpful when deciding between data vs. research to diagnose + solve a problem. Earth image credit of Moncast Drawing. As a PM, data...
Have you ever sent an email to the wrong person? Well I have. At work. From a generic support email address. To a group of our top customers. Facepalm. In March of 2018, I was working on the games team at Facebook. You may remember that month as a tumultuous...
Run experiments with more speed and accuracy We’re pleased to announce the rollout of CUPED for all our customers. Statsig will now automatically use CUPED to reduce variance and bias on experiments’ key metrics. This gives you access to a powerful experiment...