Skip to content
Contact Support

Page Deprecated

Information you find on this page may be out of date and no longer accurate.

Offloading to GPU with OpenMP

With OpenMP 4.5, it has become possible to offload computations from the CPU to a GPU, see https://www.openmp.org/wp-content/uploads/SC18-BoothTalks-Jost.pdf

Example

In the following we show how to achieve this in the case of a reduction operation involving a large loop:

#include <iostream>
#include <cmath>
int main() {
  int n = 1000000000;
  double total = 0;
  #pragma omp target teams distribute parallel for map(tofrom: total) \
    map(to: n) reduction(+:total)
  for (int i = 0; i < n; ++i) {
    total += exp(sin(M_PI * (double) i/12345.6789));
  }
  std::cout << "total is " << total << '\n';
}

Save the above code in file total.cxx.

Note the pragma

#pragma omp target teams distribute parallel for map(tofrom: total) \
  map(to: n) reduction(+:total)

which moves variables total and n to the GPU and creates teams of threads to perform the sum operation in parallel. 

Compile

We'll use the Cray C++ compiler to build the executable but first we need to load a few modules:

module load cray-libsci_acc/18.06.1 craype-accel-nvidia60 \
 PrgEnv-cray/1.0.4 cuda92/blas/9.2.88 cuda92/toolkit/9.2.88

(Ignore warning "cudatoolkit >= 8.0 is required").

To compare the execution times between the CPU and GPU version, we build two executables:

CC -h noomp -o total total.cxx
CC -o totalOmpGpu total.cxx

with executable total compiled with -h noomp, i.e. OpenMP turned off.

Run

The following commands will submit the runs to the Mahuika queue (note --partition=gpu --gres=gpu:1 in the case of the executable that offloads to the GPU):

time srun --ntasks=1 --cpus-per-task=1 ./total
time srun --ntasks=1 --cpus-per-task=1 --partition=gpu --gres=gpu:1 ./totalOmpGpu
executable time [s]
total 10.9
totalOmpGpu 0.45