GPU use on NeSI
This page provides generic information about how to access NeSI's GPU cards.
For application specific settings (e.g. OpenMP, Tensorflow on GPU, ...), please have a look at the dedicated pages listed at the end of this page.
Warning
An overview of available GPU cards is available in the Available GPUs on NeSI support page. Details about pricing in terms of compute units can be found in the What is an allocation? page.
Note
Recall, memory associated with the GPUs is the VRAM, and is a separate resource from the RAM requested by Slurm. The memory values listed below are VRAM values.
Request GPU resources using Slurm¶
To request a GPU for your Slurm job, add the following option at the beginning of your submission script:
#SBATCH --gpus-per-node=1
You can specify the type and number of GPU you need using the following syntax
#SBATCH --gpus-per-node=<gpu_type>:<gpu_number>
It is recommended to specify the exact GPU type required; otherwise, the job may be allocated to any available GPU at the time of execution.
-
A100 (PCIe 40GB VRAM)
#SBATCH --partition=genoa #SBATCH --gpus-per-node=A100:1
-
A100 ( HGX 80GB VRAM)
#SBATCH --partition=milan #SBATCH --gpus-per-node=A100:1
-
H100 ( 96GB VRAM)
#SBATCH --partition=genoa #SBATCH --gpus-per-node=H100:1
-
L4 (24GB VRAM)
#SBATCH --partition=genoa #SBATCH --gpus-per-node=L4:1
You can also use the --gpus-per-node
option in
Slurm interactive sessions,
with the srun
and salloc
commands. For example:
srun --job-name "InteractiveGPU" --gpus-per-node L4:1 --partition genoa --cpus-per-task 8 --mem 2GB --time 00:30:00 --pty bash
will request and then start a bash session with access to a L4 GPU, for a duration of 30 minutes.
Warning
When you use the --gpus-per-node
option, Slurm automatically sets the
CUDA_VISIBLE_DEVICES
environment variable inside your job
environment to list the index/es of the allocated GPU card/s on each
node.
srun --job-name "GPUTest" --gpus-per-node=L4:2 --time 00:05:00 --pty bash
srun: job 20015016 queued and waiting for resources
srun: job 20015016 has been allocated resources
$ echo $CUDA_VISIBLE_DEVICES
0,1
Load CUDA and cuDNN modules¶
To use an Nvidia GPU card with your application, you need to load the driver and the CUDA toolkit via the environment modules mechanism:
module load CUDA/11.0.2
You can list the available versions using:
module spider CUDA
Please Contact our Support Team if you need a version not available on the platform.
The CUDA module also provides access to additional command line tools:
- nvidia-smi to directly monitor GPU resource utilisation,
- nvcc to compile CUDA programs,
- cuda-gdb to debug CUDA applications.
In addition, the cuDNN (NVIDIA CUDA® Deep Neural Network library) library is accessible via its dedicated module:
module load cuDNN/8.0.2.39-CUDA-11.0.2
which will automatically load the related CUDA version. Available versions can be listed using:
module spider cuDNN
Example Slurm script¶
The following Slurm script illustrates a minimal example to request a GPU card, load the CUDA toolkit and query some information about the GPU:
#!/bin/bash -e
#SBATCH --job-name=GPUJob # job name (shows up in the queue)
#SBATCH --time=00-00:10:00 # Walltime (DD-HH:MM:SS)
#SBATCH --partition=genoa # This means the job will land on A100 with 40GB VRAM
#SBATCH --gpus-per-node=A100:1 # GPU resources required per node
#SBATCH --cpus-per-task=2 # number of CPUs per task (1 by default)
#SBATCH --mem=512MB # amount of memory per node (1 by default)
# load CUDA module
module purge
module load CUDA/11.0.2
# display information about the available GPUs
nvidia-smi
# check the value of the CUDA_VISIBLE_DEVICES variable
echo "CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}"
Save this in a test_gpu.sl
file and submit it using:
sbatch test_gpu.sl
The content of job output file would look like:
cat slurm-20016124.out
The following modules were not unloaded:
(Use "module --force purge" to unload all):
1) slurm 2) NeSI
Wed May 12 12:08:27 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... On | 00000000:05:00.0 Off | 0 |
| N/A 29C P0 23W / 250W | 0MiB / 12198MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
CUDA_VISIBLE_DEVICES=0
Note
CUDA_VISIBLE_DEVICES=0
indicates that this job was allocated to CUDA
GPU index 0 on this node. It is not a count of allocated GPUs.
Application and toolbox specific support pages¶
The following pages provide additional information for supported applications:
And programming toolkits: