ML Training Too Slow? Yelp’s 1,400x Speed Boost Fixes That
Discover the data pipeline and GPU optimisations that made it happen
TL;DR
Situation
Yelp's ad revenue relies on predicting which ads users are likely to click on, using a model called "Wide and Deep Neural Network." Initially, training this model on 450 million data samples took 75 hours per cycle, which was too slow. Yelp wanted to handle 2 billion samples and reduce training time to under an hour per cycle.
Task
The goal was to speed up the training process by improving how data is stored and read, and by using multiple GPUs to handle more data at once.
Action
Data Storage: Yelp stored the training data in Parquet format on Amazon's S3 storage, which works well with their data processing system, Spark. They found that a tool called Petastorm was too slow for their needs, so they developed their own system called ArrowStreamServer. This new system reads and sends data more efficiently, reducing the time to process 9 million samples from over 13 minutes to about 19 seconds.
Distributed Training: Yelp initially used a method called MirroredStrategy to train the model on multiple GPUs but found it didn't work well as they added more GPUs. They switched to a tool called Horovod, which allowed them to efficiently use up to 8 GPUs at once, significantly speeding up the training process.
Result
By implementing these changes, Yelp achieved a total speed increase of about 1,400 times in their model training. This means they can now train their ad prediction models much faster, allowing them to handle more data and improve their ad services.
Use Cases
Large-Scale ML Training, ML Training Optimisation, Enhancing Data Pipeline Efficiency
Tech Stack/Framework
TensorFlow, Horovod, Keras, PyArrow, Amazon S3, Apache Spark