This is the text version of the story that we shared at Druid Summit, Seattle 2022.
Every feature we build at Statsig serves a common goal — to help you better know about your product, and empower you to make good decisions for your product. Using Statsig, product teams should feel comfortable about understanding their product performance and gaining insights through data. This is the power of Product Observability. You may be able to find numerous definitions of the term on Google. In our word, it is a proactive, continuous and adaptive approach to measure the impact of what you build, and to use data to learn what to build next. Experiments, Metrics, Ultrasound, etc., are built to fulfill the observability need.
There is also a need to work with live data as soon as it lands. That is why, in June, we launched Events Explorer, a realtime observability tool to help with understanding product data with no delay. Following Statsig’s speed of product launches, it took our team less than three months to turn the idea into a publicly-available and production-ready product. To make it happen, we stood on the shoulders of giants —by leveraging Apache Druid as our realtime data engine. In this blog, we will share our journey with Druid in the past few months.
Druid is an open source distributed data store managed by the Apache Foundation. It is mostly used in realtime data application, for example, fraud detection, ads analysis and recommendation, etc., where the applications require high-volume realtime data ingestion and low latency data query capabilities. Our need of realtime events and metrics data analysis is also a great fit.
There are a lot of online resources that discuss about Druid in detail. The Druid website is a good starting point. For context, let’s briefly take look at how Druid works.
Druid Architecture
At a high level, Druid does three main tasks: Ingestion, Storage and Query. Druid embraces a distributed architecture that can handle scalability and tolerate failure well. Using the diagram above, let’s talk through what each component does:
In Master servers,
Coordinators help manage where data (aka Segments) lives and its availability.
Overlords oversee the assignment of ingestion jobs.
Metadata about the system is stored in an external metadata storage to keep the current state of the configuration.
In Query servers,
Routers route the API requests to the targeted components.
Brokers handle the query requests.
In Data servers,
MiddleManagers are the components that ingest and index data.
Historicals are the components that store data.
Besides being stored in Historicals for better query performance, data is also stored in a deep storage (e.g., AWS S3, Azure Blob Storage) for data durability.
Zookeeper serves to keep the realtime state of Druid that involves leader election, data management, task management, and so on.
Our team spent some time doing research and comparing similar products in the market. Besides its core functionalities, there are three main reasons why we ended up choosing Druid.
Many early users of Druid may be surprised at this reason. It can be a large amount of work to onboard and start using Druid, due to the number of moving parts and configuration options. Unfortunately, the complexity of Druid still exists today. But, on the bright side, the Druid community has done much work to improve the onboarding experience. Our first working cluster for testing and validation was spun up in a few hours by referring to the quick start guide and example configurations.
In addition, the community support is fantastic, on both Slack and the forum. Questions about Druid are often answered by an expert in the community within a short amount of time. The strong community support has saved us a lot of effort as a new user. With the help of the community and Druid’s thorough documentation, our onboarding experience was much smoother than what we originally expected.
On the opposite side of the complexity of a distributed system, flexibility is an advantage of it. Druid is no exception. Because of the way it is designed, each component can be managed respectively depending on its requirement and functionality. It also means that upgrades to the system are no longer intimidating compared to a monolith system. This is super helpful for us during daily operations, as the Infra team can make changes to Druid while not slowing down the product development progress on top of it.
As a product analytics company, we cannot allow data flow to slow down. Therefore, a high-performance data system is crucial to us. Both the ingestion pipeline and the query experience are performant and meet our needs. Once the ingestion is configured, there is no delay to when you see your data in the datastore and ready for query. There are benchmark results that can attest to this as well.
Although we have only been using Druid for several months, there are some learning that we think is valuable to share with the community.
At Statsig, the majority of our workloads are running inside Kubernetes across many regions. To help manage all those workloads, we adopted the GitOps approach in our daily operations. In short, GitOps leverages the DevOps practices (version control, collaboration, CI/CD, etc.) for infrastructure automation.
We use the same approach to manage Druid in our clusters and across environments. It facilitates collaboration by offering better clarity and transparency on what changes are made to the system. Rollbacks of the changes are possible when things go unexpectedly.
On top of GitOps, we also use the community-supported Druid Operator to help us manage the lifecycle and Kubernetes resources in a controlled way. It offers features, such as, rolling deploy that contributes to the uptime of Druid during upgrades, autoscaling support, and so on. Therefore, we are able to focus on tuning the parameters that directly impact end-user experience.
James Martin wrote in his book Systems Engineering Guidebook — a Process For Developing Systems And Products, “There is no perfect system, and probably never will be”. The same also applies to tuning Druid. There is never a perfect formula that works for every use case. The only correct answer that applies to Druid configuration questions is likely — “It depends”.
Like mentioned above, the example configuration for Druid can be a good starting point. But it may not work for your actual use case. Realizing that, we adopted the metrics-driven approach to tune Druid. Internally, we identified a few key metrics that can affect product performance. Druid already emits many metrics out of the box. Alongside those existing metrics, we define our own product metrics (e.g., Events Explorer query success rate, query latency) and infrastructure resource metrics (e.g., pod health, node usage, storage usage) that we keep track of. Objectives are created for the metrics, and all of the work we do to make Druid work better is to achieve those objectives.
It means that the work on Druid is continuous and adaptive based on what our product needs are. It saves our engineering time and efforts, and engineers can better decide on important things to work on.
A powerful attribute of Druid’s query is the query ID. It is a UUID that is provided as queryId in the queryContext for each query. By default, it is a UUID autogenerated by Druid. Application built on top of Druid can provide a custom queryId to when it sends the query to Druid. The queryId is available throughout the lifecycle of the query, e.g. HTTP request, Druid metrics. The queryId can also be included in the product telemetry (e.g. Statsig events). Thus, the end user query experience can be fetched using the queryId . It becomes handy in troubleshooting and monitoring as well.
There are many more we plan to do with Druid at Statsig. Just to name a few,
Enriched data dimensions and more data type ingestion
Bring real time data to more parts of the product
Allow more customization of data in Druid for users
Data tiering, as well as better usage-driven
to help with Druid resource utilization and cost
Our journey with Druid has just begun. The Statsig team is super excited about creating new features powered by Druid. We will share more stories with Druid as we progress along. If you are interested, we welcome you to join us along this journey, either as a user or a team member!
If you would like to chat more, you can find us on Slack.
Standard deviation and variance are essential for understanding data spread, evaluating probabilities, and making informed decisions. Read More ⇾
We’ve expanded our SRM debugging capabilities to allow customers to define custom user dimensions for analysis. Read More ⇾
Detect interaction effects between concurrent A/B tests with Statsig's new feature to ensure accurate experiment results and avoid misleading metric shifts. Read More ⇾
Statsig's biggest year yet: groundbreaking launches, global events, record scaling, and exciting plans for 2025. Explore our 2024 milestones and what’s next! Read More ⇾
A guide to reporting A/B test results: What are common mistakes and how can you make sure to get it right? Read More ⇾
This guide explains why the allocation point may differ from the exposure point, how it happens, and what you to do about it. Read More ⇾