Deep Learning GPU Benchmarks

GPU training/inference speeds using PyTorch®/TensorFlow for computer vision (CV), NLP, text-to-speech (TTS), etc.

PyTorch Training GPU Benchmarks 2023

Relative Training Throughput w.r.t 1xLambdaCloud V100 16GB (All Models)01234567H100 80GB SXM5H100 80GB PCIe Gen5A100 80GB SXM4A100 80GB PCIeGPU2020 Cloud A100 40GB PCIeRTX 4090RTX 6000 AdaRTX A6000RTX 3090GPU2020 Cloud A10GPU2020Cloud V100 16GBQuadro RTX 8000H100 80GB PCIe Gen5Speedup:5.45
H100 80GB SXM57.67
H100 80GB PCIe Gen55.45
A100 80GB SXM44.62
A100 80GB PCIe4.41
GPU2020 Cloud A100 40GB PCIe3.57
RTX 40902.94
RTX 6000 Ada2.86
RTX A60002.15
RTX 30901.8
GPU2020 Cloud A101.34
GPU2020Cloud V100 16GB1
Quadro RTX 80000.98

PyTorch Training GPU Benchmarks 2022

Relative Training Throughput w.r.t 1xV100 32GB (All Models) 80GB SXM4A100 80GB PCIeA100 40GB SXM4A100 40GB PCIeRTX A6000GPU2020 Cloud — RTX A6000RTX A5500RTX 3090RTX A40RTX A5000RTX A4500V100 32GBQuadro RTX 8000RTX 3080Titan RTXQuadro RTX 6000RTX A4000RTX 2080TiRTX 3080 Max-QQuadro RTX 5000GTX 1080TiRTX 3070RTX 2080 SUPER MAX-QRTX 2080 MAX-QRTX 2070 MAX-Q
A100 80GB SXM43.89
A100 80GB PCIe3.76
A100 40GB SXM43.1
A100 40GB PCIe2.85
RTX A60001.83
GPU2020 Cloud — RTX A60001.8
RTX A55001.53
RTX 30901.49
RTX A401.36
RTX A50001.19
RTX A45001.1
V100 32GB1
Quadro RTX 80000.88
RTX 30800.86
Titan RTX0.85
Quadro RTX 60000.83
RTX A40000.75
RTX 2080Ti0.66
RTX 3080 Max-Q0.58
Quadro RTX 50000.55
GTX 1080Ti0.5
RTX 30700.49
RTX 2080 SUPER MAX-Q0.37
RTX 2080 MAX-Q0.34
RTX 2070 MAX-Q0.33

YoloV5 Inference GPU Benchmarks

Relative Inference Latency w.r.t 1xRTX 8000 (All Models) 80003080A100 80GB PCIeRTX A6000
RECORD_NAMERelative Latency
RTX 80001
A100 80GB PCIe0.73
RTX A60000.7

GPU Benchmark Methodology

To measure the relative effectiveness of GPUs when it comes to training neural networks we’ve chosen training throughput as the measuring stick. Training throughput measures the number of samples (e.g. tokens, images, etc...) processed per second by the GPU.

Using throughput instead of Floating Point Operations per Second (FLOPS) brings GPU performance into the realm of training neural networks. Training throughput is strongly correlated with time to solution — since with high training throughput, the GPU can run a dataset more quickly through the model and teach it faster.

In order to maximize training throughput it’s important to saturate GPU resources with large batch sizes, switch to faster GPUs, or parallelize training with multiple GPUs. Additionally, it’s also important to test throughput using state of the art (SOTA) model implementations across frameworks as it can be affected by model implementation.



We are working on new benchmarks using the same software version across all GPUs. GPU2020's PyTorch® benchmark code is available 

The 2023 benchmarks used using NGC's PyTorch® 22.10 docker image with Ubuntu 20.04, PyTorch® 1.13.0a0+d0d6b1f, CUDA 11.8.0, cuDNN, NVIDIA driver 520.61.05, and NVIDIA's optimized model implementations.

The 2022 benchmarks used using NGC's PyTorch® 21.07 docker image with Ubuntu 20.04, PyTorch® 1.10.0a0+ecc3718, CUDA 11.4.0, cuDNN, NVIDIA driver 470, and NVIDIA's optimized model implementations in side of the NGC container.

PyTorch® is a registered trademark of The Linux Foundation.



YOLOv5 is a family of SOTA object detection architectures and models pretrained by Ultralytics. We use the opensource implementation in this repo to benchmark the inference lantency of YOLOv5 models across various types of GPUs and model format (PyTorch®, TorchScript, ONNX, TensorRT, TensorFlow, TensorFlow GraphDef). Details for input resolutions and model accuracies can be found here.