ML Training Too Slow? Yelp’s 1,400x Speed Boost Fixes That

Discover the data pipeline and GPU optimisations that made it happen

Feb 12, 2025

∙ Paid

A cell phone sitting on top of a wooden table — Photo by appshunter.io on Unsplash

TL;DR

Situation

Yelp's ad revenue relies on predicting which ads users are likely to click on, using a model called "Wide and Deep Neural Network." Initially, training this model on 450 million data samples took 75 hours per cycle, which was too slow. Yelp wanted to handle 2 billion samples and reduce training time to under an hour per cycle.

Task

The goal was to speed up the training process by improving how data is stored and read, and by using multiple GPUs to handle more data at once.

Action

Data Storage: Yelp stored the training data in Parquet format on Amazon's S3 storage, which works well with their data processing system, Spark. They found that a tool called Petastorm was too slow for their needs, so they developed their own system called ArrowStreamServer. This new system reads and sends data more efficiently, reducing the time to process 9 million samples from over 13 minutes to about 19 seconds.
Distributed Training: Yelp initially used a method called MirroredStrategy to train the model on multiple GPUs but found it didn't work well as they added more GPUs. They switched to a tool called Horovod, which allowed them to efficiently use up to 8 GPUs at once, significantly speeding up the training process.

Result

By implementing these changes, Yelp achieved a total speed increase of about 1,400 times in their model training. This means they can now train their ad prediction models much faster, allowing them to handle more data and improve their ad services.

Use Cases

Large-Scale ML Training, ML Training Optimisation, Enhancing Data Pipeline Efficiency

Tech Stack/Framework

TensorFlow, Horovod, Keras, PyArrow, Amazon S3, Apache Spark

Explained Further

The Challenge

Keep reading with a 7-day free trial

Subscribe to Data Tinkerer to keep reading this post and get 7 days of free access to the full post archives.