Scaling Apache Flink: How Reddit Cut Memory Usage by 60%
Optimizing real-time ad validation with field filtering, tiered storage, and infrastructure enhancements.
Situation
Reddit's advertising platform processes thousands of ad engagement events per second, necessitating real-time validation and enrichment to ensure accurate reporting and prevent budget overdelivery.
Task
Develop a scalable, real-time ad event validation system capable of efficiently handling high event volumes while maintaining performance and reliability.
Action
The engineering team developed the Ad Events Validator (AEV) utilizing Apache Flink to correlate ad server events with user engagement events. To overcome issues related to large state sizes and resource demands, they implemented:
Field Filtering: Conducted a thorough analysis of downstream data consumption, establishing an allowlist that significantly reduced the event payload size by 90%, leading to CPU and memory usage reductions of 25% and 60%, respectively.
Tiered State Storage: Integrated Apache Cassandra for external state storage, effectively reducing in-memory state size and enhancing the efficiency of checkpointing and system recovery processes.
Result
These strategic enhancements resulted in a more scalable and cost-efficient AEV system, improving overall performance and operational effectiveness.
Use Cases
Real-Time Event Validation, Data Enrichment, Resource Optimization
Tech Stack/Framework
Apache Flink, Apache Kafka, Apache Cassandra
Explained Further
Background
Reddit processes thousands of ad engagement events per second. These events require validation and enrichment before being sent to downstream systems. Key components of this validation process include applying a standardized look-back window and filtering out suspected invalid traffic.
In addition to a batch validation pipeline, a near real-time pipeline improves budget spend accuracy and provides advertisers with real-time insights into campaign performance. This real-time component, known as the Ad Events Validator (AEV), is built using Apache Flink. AEV matches ad server events with engagement events and writes the validated results to a separate Kafka topic for downstream consumption.
Building and maintaining AEV though, presented several challenges to the Reddit team