Data Tinkerer

Data Tinkerer

Share this post

Data Tinkerer
Data Tinkerer
How Datadog Achieved 99% Timeout Reduction with 20x Scalability Boost
Data Engineering

How Datadog Achieved 99% Timeout Reduction with 20x Scalability Boost

Discover the architecture that cut costs by 50% and unlocked massive scalability

Data Tinkerer's avatar
Data Tinkerer
Jan 22, 2025
∙ Paid
2

Share this post

Data Tinkerer
Data Tinkerer
How Datadog Achieved 99% Timeout Reduction with 20x Scalability Boost
2
Share
(Source: Datadog)

TL;DR


Situation

Datadog's time-series database, designed in 2016, struggled to manage a 30x growth in data volume and rising query complexity, resulting in slower performance and higher maintenance overhead.

Task

Develop a scalable indexing system to efficiently process high-cardinality data while improving query speed and reducing operational costs.

Action

The team implemented an inverted index inspired by search engines, mapping tags to time-series IDs. Using RocksDB for storage, they ensured scalability, reliability, and efficient query filtering.

Result

Query performance improved by 99%, enabling support for 20x higher cardinality metrics, reducing query timeouts, and cutting operational costs by nearly 50%.

Use Cases

Real-Time Monitoring, Tag-Based Filtering, Dynamic Schema Handling, Query Execution

Tech Stack/Framework

RocksDB, Apache Kafka, SQLite, Time-Series Database, Rust


Explained Further


Understanding the Problem

Datadog’s timeseries database faced significant challenges as data volumes grew 30x between 2017 and 2022. The increasing complexity of user queries and higher data cardinality strained the existing indexing system, introduced in 2016. The original architecture became a bottleneck for query performance and required substantial maintenance.


Metrics Platform Overview

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Data Tinkerer
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share