How HubSpot Optimized Logging to Save Millions
By refining log storage and retention, HubSpot reduced costs by 55.7% and improved query performance by 50x
TL;DR
Situation
HubSpot's backend performance team identified that Amazon S3 storage costs accounted for approximately 45% to 50% of daily expenses, with the 'hubspot-live-logs-prod' bucket alone responsible for 20% of these costs.
Task
The team aimed to reduce storage costs by addressing the inefficiencies in their logging system, particularly focusing on the large volumes of raw JSON logs that were not being efficiently compacted.
Action
Log Retention Review: They discovered that raw JSON logs were retained for 730 days, while compressed ORC logs were kept for 460 days. Aligning the retention period to 460 days for both formats reduced unnecessary storage.
Improved Compression: By enhancing their Spark compaction process, they increased the conversion rate of raw JSON logs to the more storage-efficient ORC format, achieving a compression ratio where ORC logs were about 5% the size of the original JSON logs.
Result
These measures led to a 55.7% reduction in monthly JSON log storage costs, translating to annual savings in the seven-figure range. Additionally, engineers experienced faster log query times, with some reporting reductions from 30 minutes to just 36 seconds.
Use Cases
Cost monitoring, Log retention, Log volume reduction
Tech Stack/Framework
AWS Athena, Amazon S3, Apache Spark, Apache Mesos, Redash