How Datadog Achieved 99% Timeout Reduction with 20x Scalability Boost
Discover the architecture that cut costs by 50% and unlocked massive scalability
TL;DR
Situation
Datadog's time-series database, designed in 2016, struggled to manage a 30x growth in data volume and rising query complexity, resulting in slower performance and higher maintenance overhead.
Task
Develop a scalable indexing system to efficiently process high-cardinality data while improving query speed and reducing operational costs.
Action
The team implemented an inverted index inspired by search engines, mapping tags to time-series IDs. Using RocksDB for storage, they ensured scalability, reliability, and efficient query filtering.
Result
Query performance improved by 99%, enabling support for 20x higher cardinality metrics, reducing query timeouts, and cutting operational costs by nearly 50%.
Use Cases
Real-Time Monitoring, Tag-Based Filtering, Dynamic Schema Handling, Query Execution
Tech Stack/Framework
RocksDB, Apache Kafka, SQLite, Time-Series Database, Rust
Explained Further
Understanding the Problem
Datadog’s timeseries database faced significant challenges as data volumes grew 30x between 2017 and 2022. The increasing complexity of user queries and higher data cardinality strained the existing indexing system, introduced in 2016. The original architecture became a bottleneck for query performance and required substantial maintenance.