Deep Learning GPU Benchmarks

GPU training/inference speeds using PyTorch/TensorFlow for computer vision (CV), NLP, text-to-speech (TTS), etc.

PyTorch GPU Benchmarks

Visualization

Metric

Precision

Number of GPUs

Model

TensorFlow Training GPU Benchmarks

Visualization

Metric

Precision

Number of GPUs

Model

YoloV5 Inference GPU Benchmarks

Visualization

Metric

Precision

Methods

Model

GPU Benchmark Methodology

To measure the relative effectiveness of GPUs when it comes to training neural networks we’ve chosen training throughput as the measuring stick. Training throughput measures the number of samples (e.g. tokens, images, etc...) processed per second by the GPU.

Using throughput instead of Floating Point Operations per Second (FLOPS) brings GPU performance into the realm of training neural networks. Training throughput is strongly correlated with time to solution — since with high training throughput, the GPU can run a dataset more quickly through the model and teach it faster.

In order to maximize training throughput it’s important to saturate GPU resources with large batch sizes, switch to faster GPUs, or parallelize training with multiple GPUs. Additionally, it’s also important to test throughput using state of the art (SOTA) model implementations across frameworks as it can be affected by model implementation.

TensorFlow

We are working on new benchmarks using the same software version across all GPUs. Lambda's TensorFlow benchmark code is available here.

The RTX A6000 was benchmarked using NGC's TensorFlow 20.10 docker image using Ubuntu 18.04, TensorFlow 1.15.4, CUDA 11.1.0, cuDNN 8.0.4, NVIDIA driver 455.32, and Google's official model implementations.

The A100s, RTX 3090, and RTX 3080 were benchmarked using Ubuntu 18.04, TensorFlow 1.15.4, CUDA 11.1.0, cuDNN 8.0.4, NVIDIA driver 455.45.01, and Google's official model implementations.

Pre-ampere GPUs were benchmarked using TensorFlow 1.15.3, CUDA 10.0, cuDNN 7.6.5, NVIDIA driver 440.33, and Google's official model implementations.

PyTorch

We are working on new benchmarks using the same software version across all GPUs. Lambda's PyTorch benchmark code is available here.

The RTX A6000, A100s, RTX 3090, and RTX 3080 were benchmarked using NGC's PyTorch 20.10 docker image with Ubuntu 18.04, PyTorch 1.7.0a0+7036e91, CUDA 11.1.0, cuDNN 8.0.4, NVIDIA driver 460.27.04, and NVIDIA's optimized model implementations.

Pre-ampere GPUs were benchmarked using NGC's PyTorch 20.01 docker image with Ubuntu 18.04, PyTorch 1.4.0a0+a5b4d78, CUDA 10.2.89, cuDNN 7.6.5, NVIDIA driver 440.33, and NVIDIA's optimized model implementations.

YoloV5

YOLOv5 is a family of SOTA object detection architectures and models pretrained by Ultralytics. We use the opensource implementation in this repo to benchmark the inference lantency of YOLOv5 models across various types of GPUs and model format (PyTorch, TorchScript, ONNX, TensorRT, TensorFlow, TensorFlow GraphDef). Details for input resolutions and model accuracies can be found here.