Bede Tier 2 HPC: Nvidia Grace-Hopper Superchip Pilot

Peter Heywood

8 May 2024 15:00

GH200 GPUs now available in N8 CIR Bede

Members of the University of Sheffield have access to a range of GPU resources for carrying out their research, available in local (Tier 3) and affiliated regional (Tier 2) HPC systems.

As of March 2024, the N8 CIR Bede Tier 2 HPC facility now includes an Open Pilot of 3 NVIDIA GH200 Nodes which are available to all users.

Each GH200 node in Bede contains a single NVIDIA GH200 Grace Hopper Superchip - a 72 core NVIDIA Grace ARM CPU connected to a single NVIDIA Hopper GPU via a 900GB/s NVIDIA NVLink-C2C interconnect. This new interconnect allows data to be moved between the host and device with a much higher bandwidth than in traditional PCI-e based systems, reducing the time spent transferring data.

The following figure shows the theoretical peak bandwidth for the range of GPU interconnect technologies used in a range of GPUs.

Figure 1: GPU host-device interconnect theoretical peak bandwidth Source: github.com/ptheywood/gpu-interconnect-plots

PyTorch LLM Fine-tuning Benchmark

To illustrate the performance of the GH200 GPUs for machine learning workloads, a benchmark of LLM fine-tuning previously used by Farhad Allian (a Research Data Engineer in Research & Innovation IT) to investigate the performance of NVIDIA L40 GPUs for machine learning benchmarked on the GH200 GPUs in Bede.

The benchmark uses the HuggingFace Transformers run_clm.py example to train and evaluate the fine-tuning of the GPT-2 124 million parameter LLM using the WikiText-2 (raw) dataset in FP32 and FP16 precisions. Each benchmark was repeated 3 times, using a single batch size of 8. This batch size allows the benchmark to be repeated on GPUs with lower memory capacity, but larger batch sizes would likely improve performance for GPUs with sufficient memory, such as the GH200.

As of April 2024, pre-built binary wheels and conda packages for PyTorch for aarch64 systems such as the GH200 do not include CUDA support. Instead, the benchmark was containerised via Apptainer, using containers based on the NGC PyTorch containers. Version 24.02 was used for this benchmark, resulting in a software environment containing:

Python 3.10
CUDA 12.3.2
PyTorch 2.3.0a0+ebedce2
HuggingFace Transformers 4.37.0

The benchmark was then executed on V100 GPUs in Bessemer, A100 & H100 PCIe GPUs in Stanage, and the GH200 GPUs in Bede. Source files, instructions, job submission scripts and the generated results and figures can be found in the RSE-Sheffield/pytorch-transformers-wikitext2-benchmark GitHub repository.

FP32 Results

The following figures and tables show the benchmark data for FP32 training and inferencing on across a range of GPUs.

This includes the runtime training and inferencing phases in seconds (lower is better), and the samples processing rate in samples per second (higher is better).

As you might expect, newer generations of GPU offer reduced application runtimes and increased performance compared to previous generations, with the GH200 outperforming the V100 SXM2 GPUs in Bessemer, the A100 SXM4 GPUs in Stanage and the H100 PCIe GPUs in Stanage.

Figure 2: FP32 Runtime (s)

Figure 3: FP32 Samples per Second

Metric	V100 SXM2	A100 SXM4	H100 PCIe	GH200 480GB
FP32 Training Time (s)	733.447	204.360	181.747	114.210
FP32 Inference Time (s)	9.827	3.287	2.973	1.997
FP32 Training Samples per Second	9.481	34.028	38.261	60.886
FP32 Inference Samples per Second	24.413	72.932	80.666	119.908

FP16 Results

The following figures and tables show the benchmark data for FP16 training and inferencing on across a range of GPUs.

This includes the runtime training and inferencing phases in seconds (lower is better), and the samples processing rate in samples per second (higher is better).

As with the FP32 results, the newer generations of GPU offer improved performance over older GPUs, with the GH200 out-performing the other models. The relative performance difference will vary from workload to workload, with larger batch sizes likely showing increased performance.

Figure 4: FP16 Runtime (s)

Figure 5: FP16 Samples per Second

Metric	V100 SXM2	A100 SXM4	H100 PCIe	GH200 480GB
FP16 Training Time (s)	376.310	198.677	172.673	116.243
FP16 Inference Time (s)	5.723	3.290	2.893	2.147
FP16 Training Samples per Second	18.479	35.001	40.271	59.833
FP16 Inference Samples per Second	41.892	72.819	82.826	111.463

Accessing Bede

As a member organisation of the N8 CIR, Bede is available for use by an researchers at the University of Sheffield.

Access is granted on a per project basis, with the N8 CIR Bede website providing instructions on how to apply for access via the online form. Once submitted, the application will be reviewed and if deemed appropriate and compatible with Bede the project will be created.

Bede’s online documentation now includes GH200 specific information on the appropriate pages, in addition to the high level overview of the GH200 pilot. However, as there are only a limited number of GH200 GPUs in Bede at this time, jobs may spend a significant amount of time in the queue.

In addition to Bede, Sheffield researchers can also access a range of GPUs in our local Tier 3 facilities Bessemer and Stanage; as well as the Tier 2 JADE HPC Facility.

Previous Next

Blog Home