Using GPUs

This page provides generic information about how to access GPUs through the Slurm scheduler.

Warning

Your first stop when looking into using GPUs should be the documentation of the application you are using.
Not every process can use a GPU, and how to use them effectively varies greatly!
There is a list of commonly used GPU supporting software at the bottom of this page.

Note

Recall, memory associated with the GPUs is the VRAM, and is a separate resource from the RAM requested by Slurm. The memory values listed below are VRAM values.

Request GPU resources using Slurm¶

To request a GPU for your Slurm job, add the following option in the header of your submission script:

#SBATCH --gpus-per-node=1

You can specify the type and number of GPU you need using the following syntax

#SBATCH --gpus-per-node=<gpu_type>:<gpu_number>

It is recommended to specify the exact GPU type required; otherwise, the job may be allocated to any available GPU at the time of execution.

Architecture	VRAM	Max Request	Slurm Header
Any	24GB-80GB	4	`#SBATCH --gpus-per-node=1`
NVIDIA A100	80GB	4	`#SBATCH --partition=milan #SBATCH --gpus-per-node=a100:1`
NVIDIA A100	40GB	2	#SBATCH --partition=genoa #SBATCH --gpus-per-node=a100:1
NVIDIA H100	96GB	2	`#SBATCH --gpus-per-node=h100:1`
NVIDIA L4	24GB	4	`#SBATCH --gpus-per-node=l4:1`

You can also use the --gpus-per-nodeoption in Slurm interactive sessions, with the srun and salloc commands. For example:

srun --job-name "InteractiveGPU" --gpus-per-node L4:1 --partition genoa --cpus-per-task 8 --mem 2GB --time 00:30:00 --pty bash

will request and then start a bash session with access to a L4 GPU, for a duration of 30 minutes.

Warning

When you use the --gpus-per-nodeoption, Slurm automatically sets the CUDA_VISIBLE_DEVICES environment variable inside your job environment to list the index/es of the allocated GPU card/s on each node.

srun --job-name "GPUTest" --gpus-per-node=L4:2 --time 00:05:00 --pty bash

srun: job 20015016 queued and waiting for resources
srun: job 20015016 has been allocated resources
$ echo $CUDA_VISIBLE_DEVICES
0,1

Load CUDA and cuDNN modules¶

To use an Nvidia GPU card with your application, you need to load the driver and the CUDA toolkit via the environment modules mechanism:

module load CUDA/11.0.2

You can list the available versions using:

module spider CUDA

Please Contact our Support Team if you need a version not available on the platform.

The CUDA module also provides access to additional command line tools:

nvidia-smi to directly monitor GPU resource utilisation,
nvcc to compile CUDA programs,
cuda-gdb to debug CUDA applications.

In addition, the cuDNN (NVIDIA CUDA® Deep Neural Network library) library is accessible via its dedicated module:

module load cuDNN/8.0.2.39-CUDA-11.0.2

which will automatically load the related CUDA version. Available versions can be listed using:

module spider cuDNN

Example Slurm script¶

The following Slurm script illustrates a minimal example to request a GPU card, load the CUDA toolkit and query some information about the GPU:

#!/bin/bash -e

#SBATCH --job-name=GPUJob        # job name (shows up in the queue)
#SBATCH --time=00-00:10:00       # Walltime (DD-HH:MM:SS)
#SBATCH --partition=genoa        # This means the job will land on A100 with 40GB VRAM
#SBATCH --gpus-per-node=A100:1   # GPU resources required per node
#SBATCH --cpus-per-task=2        # number of CPUs per task (1 by default)
#SBATCH --mem=512MB              # amount of memory per node (1 by default)

# load CUDA module
module purge
module load CUDA/11.0.2

# display information about the available GPUs
nvidia-smi

# check the value of the CUDA_VISIBLE_DEVICES variable
echo "CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}"

Save this in a test_gpu.sl file and submit it using:

sbatch test_gpu.sl

The content of job output file would look like:

cat slurm-20016124.out

The following modules were not unloaded:
   (Use "module --force purge" to unload all):

  1) slurm   2) NeSI
Wed May 12 12:08:27 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:05:00.0 Off |                    0 |
| N/A   29C    P0    23W / 250W |      0MiB / 12198MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
CUDA_VISIBLE_DEVICES=0

Note

CUDA_VISIBLE_DEVICES=0 indicates that this job was allocated to CUDA GPU index 0 on this node. It is not a count of allocated GPUs.

Application and toolbox specific support pages¶

The following pages provide additional information for supported applications:

And programming toolkits: