How Datadog Achieved 99% Timeout Reduction with 20x Scalability Boost

Discover the architecture that cut costs by 50% and unlocked massive scalability

Jan 22, 2025

∙ Paid

TL;DR

Situation

Datadog's time-series database, designed in 2016, struggled to manage a 30x growth in data volume and rising query complexity, resulting in slower performance and higher maintenance overhead.

Task

Develop a scalable indexing system to efficiently process high-cardinality data while improving query speed and reducing operational costs.

Action

The team implemented an inverted index inspired by search engines, mapping tags to time-series IDs. Using RocksDB for storage, they ensured scalability, reliability, and efficient query filtering.

Result

Query performance improved by 99%, enabling support for 20x higher cardinality metrics, reducing query timeouts, and cutting operational costs by nearly 50%.

Use Cases

Real-Time Monitoring, Tag-Based Filtering, Dynamic Schema Handling, Query Execution

Tech Stack/Framework

RocksDB, Apache Kafka, SQLite, Time-Series Database, Rust

Explained Further

Understanding the Problem

Datadog’s timeseries database faced significant challenges as data volumes grew 30x between 2017 and 2022. The increasing complexity of user queries and higher data cardinality strained the existing indexing system, introduced in 2016. The original architecture became a bottleneck for query performance and required substantial maintenance.

Metrics Platform Overview

Continue reading this post for free, courtesy of Data Tinkerer.

Or purchase a paid subscription.