For us, as a startup, shorter training times give us the ability to experiment more with new datasets, new architectures and new approaches. This allows us to iterate faster, and bring products to market quicker. It’s all part of being a startup!
Convolutional Neural Networks generally can take a long time to train. We have found that even performing adequate transfer learning on a pre-trained model such as VGG16 or ResNet can take over an hour per epoch if working with a large dataset over a pipeline which includes aggressive image augmentation.
So we set out to explore some of the low-hanging fruit that anyone can exploit for quick wins in terms of per-epoch training times in Keras.
Check Your Hardware Config!
This may sound obvious, but incorrect GPU configuration is frequently an issue in poor training systems. Thankfully it’s easy to detect. Start training a Keras model, and check the output of
nvidia-smi in the terminal. You should see something like this:
This essentially tells you that a python process is occupying the majority of the memory (the other bits are involved in Ubuntu’s Gnome Shell and X), and the GPU is being utilised. Looks good!
If you cannot run
nvidia-smi or you see a different output than the above, then incorrect GPU software is the usual culprit. In Ubuntu, this process can be non-trivial, and involves installing the Nvidia drivers using
apt and then installing
tensorflow-gpu as the backend for Keras. Check out the guides on www.python36.com to ensure that your GPU is correctly installed and configured.
While the GPU is doing all of the weight computation during training, the CPU is still doing a lot of heavy lifting prior to offloading the data to the GPU memory. If you are using image augmentation (like we do, substantially) through
ImageDataGenerator in Keras, CPU bottlenecks can be significant. Basically, the CPU is loading data into RAM, and then applying a number of transformations prior to offloading it to the GPU. This is a lot of work! This is what we discovered in our
nvidia-smi to lead us to understand the CPU bottleneck in our training:
As you can see, only one of the sixteen threads available on our Ryzen 7 is actually doing any work, which is causing our Titan V to become rather redundant in the training process. On a sample CNN, this lead to a transfer learning per-epoch time on VGG16 to be around the three minute mark.
Thankfully the fix in Keras is rather simple. Add the following to your
.fit_generator call, replacing
workers=16 with the number of available threads on your CPU:
history = model.fit_generator(
This parallelization of the ImageDataGenerator instances led to an improvement in per-epoch training time to 21 seconds. This is significant, as we can churn through more training and keep our GPU utilised, enabling us to experiment more and create better models.
Finally, IO Bottlenecks
We keep our active training data on a Samsung Pro 960 1TB NVME SSD. This is the fastest drive we can get within our budget. If you are running your datasets from a SATA SSD, or even worse, a spinning drive, then this is likely to be causing some real issues.
If you are using
flow_from_directory to load your images, then Keras needs to load in batches of images from the drive before the CPU / GPU can even start working on them.
Basically we concur with Jeremy Howard from fast.ai on this one. Get the biggest and fastest drive you can afford, and preferably an NVME M.2 SSD if your budget permits to reduce this bottleneck and move data as quickly as possible to RAM.
Simple checks and optimisations such as confirming correct GPU driver installation, using a fast SSD to store training data, and optimising your model to take advantage of multicore CPUs can lead to huge reductions in training times for convolutional neural networks. If you are noticing performance issues, give some of the above tips a try!