Compare GPUs: H100 or what
In this post, we want to break down the differences between NVIDIA’s top-tier GPUs and identify workloads where each GPU model performs at its best.
The role of numerical precision
The recent success of NVIDIA’s H100, L4, and L40 GPUs is heavily attributed to their support for FP8 precision, crucial for transformer models. The importance of FP8 lies in its role in numerical representation. In machine learning, floating point (FP) formats are pivotal as they influence the precision of stored and operational numbers, impacting model output quality.
FP64, or double-precision format, occupies 64 bits but is not typically used in ML, being more relevant in scientific fields. The L40 and L4 GPUs, lacking hardware-level FP64 support, are unsuited for such scientific applications. Historically, FP32 served as the standard for deep learning, until the shift to FP16, which reduced memory usage and accelerated computations without noticeable quality loss, thus becoming the preferred standard.
TF32 is another format, differing slightly from FP32 but facilitating faster computation by converting FP32 values at the driver level for Tensor cores without code changes.
INT8, an integer format, is utilized post-training through quantization to reduce memory and accelerate inference, although this is less effective for transformers due to their structure and post-training conversion challenges.
FP8 emerges as a solution, particularly for transformer models, enabling conversion of pre-trained models to FP8, thereby reducing hardware requirements without losing performance. This is especially beneficial when transitioning from A100 to H100 GPUs. FP8 also supports mixed-precision training, reducing RAM needs and speeding up the process, eliminating the need for conversion during inference, with parameters already in FP8.
What types of GPU cores exist
Now it’s time to briefly discuss the differences between the types of GPU cores, which will be crucial to our conclusions.
- CUDA cores: These are general-purpose cores that can be adapted to a variety of tasks. While they can handle any computation, they aren’t as efficient as specialized cores.
- Tensor cores: Designed for matrix multiplication, these cores excel at this particular task. This specialization makes them the standard choice for deep learning.
- RT (ray tracing) cores: These cores are optimized for tasks that involve ray tracing, such as rendering images or video. When the goal is to produce photorealistic imagery, a GPU outfitted with RT cores is the way to go.
Almost every GPU model relevant to this article comes equipped with both CUDA and Tensor cores. However, only the L4 and L40 have RT cores.
Key GPU specs and performance benchmarks for ML, HPC, and graphics
With the numerical formats out of the way, let’s turn our attention to the evolution of GPU specs and their standout features. We’ll focus only on the technical specifications relevant to the context of this article. Pay particular attention to the first two rows: the amount of RAM and its bandwidth. An ML model must fit snugly into the GPU accessible to the runtime environment. Otherwise, you’ll need multiple GPUs for training. During inference, it’s often possible to fit everything on a single chip. The H100 SXM metric is used as the 100% benchmark, and all others are normalized relative to it. The chart reveals, for example, that 8-bit inference on the H100 GPU is 37% faster than 16-bit inference on the same GPU model. This is due to the hardware support for FP8-precision computing. When we say “hardware support, ” we’re talking about the entire low-level pipeline that moves data from the RAM to the Tensor core for computation.
Final overview
We can discern two primary use case categories: tasks that focus purely on computation (labeled “Compute”) and those that incorporate visualization (labeled “Graphic”). To reiterate, the A100 and H100 are entirely unsuitable for graphics, while the L4 and L40 are tailor-made.
At first glance, you might assume that the A100 or L40 would be equally good at inference. However, as we unpacked earlier in our section on the transformer architecture, there are nuances to consider.
The column labeled “HPC” indicates whether multiple hosts can be merged into a single cluster. In inference, clustering is rarely necessary — but that depends on the model’s size.
The key point is ensuring that the model must fit into the memory of all the GPUs on a host. If a model oversteps this boundary, or if the host can’t accommodate enough GPUs for their combined RAM, then a GPU cluster is required. With the L40 and L4, we are limited in scalability by the capabilities of an individual host. The H100 and A100 don’t have that limitation.
So, which NVIDIA GPU should you use in your ML workloads? Let’s look at your options:
L4: An affordable, general-purpose GPU suitable for a variety of use cases. It’s an entry-level model serving as the gateway to the world of GPU-accelerated computing.
L40: Optimized for generative AI inference and visual computing workloads.
A100: Offers superior price-to-performance value for single-node training of conventional CNN networks.
H100: The best choice for BigNLP, LLMs, and transformers. It’s also well-equipped for distributed training scenarios, as well as for the inference.