Data Tinkerer

Data Tinkerer

Data Engineering

ML Training Too Slow? Yelp’s 1,400x Speed Boost Fixes That

Discover the data pipeline and GPU optimisations that made it happen

Data Tinkerer's avatar
Data Tinkerer
Feb 12, 2025
∙ Paid
3
2
Share
A cell phone sitting on top of a wooden table
Photo by appshunter.io on Unsplash

TL;DR


Situation

Yelp's ad revenue relies on predicting which ads users are likely to click on, using a model called "Wide and Deep Neural Network." Initially, training this model on 450 million data samples took 75 hours per cycle, which was too slow. Yelp wanted to handle 2 billion samples and reduce training time to under an hour per cycle.

Task

The goal was to speed up the training process by improving how data is stored and read, and by using multiple GPUs to handle more data at once.

Action

  • Data Storage: Yelp stored the training data in Parquet format on Amazon's S3 storage, which works well with their data processing system, Spark. They found that a tool called Petastorm was too slow for their needs, so they developed their own system called ArrowStreamServer. This new system reads and sends data more efficiently, reducing the time to process 9 million samples from over 13 minutes to about 19 seconds.

  • Distributed Training: Yelp initially used a method called MirroredStrategy to train the model on multiple GPUs but found it didn't work well as they added more GPUs. They switched to a tool called Horovod, which allowed them to efficiently use up to 8 GPUs at once, significantly speeding up the training process.

Result

By implementing these changes, Yelp achieved a total speed increase of about 1,400 times in their model training. This means they can now train their ad prediction models much faster, allowing them to handle more data and improve their ad services.

Use Cases

Large-Scale ML Training, ML Training Optimisation, Enhancing Data Pipeline Efficiency

Tech Stack/Framework

TensorFlow, Horovod, Keras, PyArrow, Amazon S3, Apache Spark


Explained Further


The Challenge

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Data Tinkerer
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture