OpenMP* Accelerator Offload

Portability Across Heterogeneous Architectures

The OpenMP standard has supported accelerator offload since version 4.0. These directives enable users to offload data and computation to devices like GPUs. This makes it easier to write portable, heterogeneous parallel code. In this article, we discuss some of the OpenMP offload directives and show their usage with code samples. We also show some OpenACC* to OpenMP porting examples.

Porting OpenACC to OpenMP

OpenACC is the directive-based programming method for NVIDIA* GPUs, but lack of support from other vendors limits it to one platform. OpenMP offload, on the other hand, has broader industry support: the oneAPI framework, the NVIDIA* HPC SDK, the AMD ROCm* stack, and the IBM* XL compiler suite. There is nearly a 1:1 mapping of OpenACC directives to OpenMP (Table 1), so porting legacy OpenACC code to OpenMP is usually easy and straightforward. Table 1 shows some commonly used OpenACC pragmas and their OpenMP equivalents.

Table 1. Common OpenACC pragmas and their OpenMP equivalents

Figures 1a and 1b show a code snippet ported from OpenACC to OpenMP. This is a kernel from a radio astronomy package tConvolveACC. The OpenACC directive, #pragma acc parallel loop, is replaced with the OpenMP offload directive, #pragma omp target parallel for, plus explicit data transfer directives to and from the target device. The OpenACC implementation possibly did an implicit copy or used unified shared memory allocation to manage the data transfer.

Figure 1a. The sample kernel from tConvolveACC implemented in OpenACC.
Figure 1b. The sample kernel from tConvolveACC implemented in OpenMP.

 

OpenMP Offload on Intel® Platforms

We now look at the steps required to build and execute the offload code. We tested our OpenMP offload code with the 2021.2.0 version of the Intel® oneAPI Base Toolkit, using the following compiler flags:

The -fiopenmp -fopenmp-targets=spir64 flags are two new options that tell the compiler to generate a fat binary for the GPU. The -vpo-paropt-enable-64bit-opencl-atomics=true compiler option enables atomic and reduction operations. See the online documentation for more details.

The user needs to set the OMP_TARGET_OFFLOAD environment variable to run OpenMP offload code on the GPU. (A runtime error will result if the GPU is unavailable.) The user can also choose between the Level Zero or OpenCL™ backends:

The LIBOMPTARGET_DEBUG environment variable can be set to one or higher to obtain GPU offload debugging information. In Figure 2a, we highlight the debug information from the tConvolveACC OpenMP offload kernel when run with the Level Zero plugin. The two offload regions are in functions gridKernelACC and degridKernelACC, which belong to a class named Benchmark. Figure 2b shows the variable being transferred to the target device by the map clause. Figure 2c shows the data being transferred from the host to the target device. Once all the data required for the computation is present on the device, the kernel is executed, as shown at the bottom of Figure 2a.

Figure 2a. Class and function information for the tConvolveACC OpenMP offload kernel, highlighted in red.
Figure 2b. Variable information for the tConvolveACC OpenMP offload kernel, highlighted in blue.
Figure 2c. Data transfer information for the tConvolveACC OpenMP offload kernel, highlighted in green.

 

Mapping OpenMP Threads to the Target Device

At run time, the OpenMP thread hierarchy is mapped to the target device. The #pragma omp teams construct creates a league of teams, and the initial thread in each team executes the region. The #pragma omp distribute clause distributes the work across the initial threads in the teams, with each team scheduled on a subslice (on Intel® GPUs). Further parallelization of work within each team is done with the parallel for clause, with the threads in a team mapped onto the execution unit (EU) threads. Finally, the #pragma omp simd clause uses the EU vector lanes to run vectorized code. Threads within a team synchronize at the end of a work sharing construct. This is illustrated for Intel® processor graphics (9th generation), which has one slice, three subslices, eight EUs/subslice, seven threads/EU, and SIMD vector processing units in each EU (Figure 3). Mapping of OpenMP offload pragmas to these respective units on 9th generation Intel processor graphics is also shown.

Figure 3. Mapping OpenMP offload to hardware features on 9th generation Intel processor graphics (adapted from OpenMP Offloading Verification and Validation: Workflow and Road to 5.0).

 

OpenMP Directives for Better Data Transfer to/from the Target Device

Having built an application and successfully offloaded some of the kernels to the target, the next step would be to explore optimization opportunities, such as data transfer. OpenMP has directives to implement efficient data transfer between host and target. Shown below is an example, tHogbomCleanACC, which has two offload targets in the HogbomClean function. A naïve OpenMP offload would result in data transfer during both target invocations. The problem gets worse if this is repeated in a loop for g_niters, as shown in the code snippet (Figure 4a).

Figure 4a. Naïve implementation of two OpenMP offload kernels resulting in unnecessary data transfers.

 

Shown in Figure 4b is an optimized implementation of HogbomClean function that does more efficient data transfer. The #pragma omp target data map statement defines the scope for the data to be persistent on the target. Any kernel offload within this scope can reuse the data (with the handle). Subsequent map calls to the offloaded kernel will not require data transfers (except for the ones that are explicitly marked for transfer).

Figure 4b. OpenMP orphaning example with more efficient copy once and reuse data transfer.

 

Enhanced Support for Variant Function Dispatch

The OpenMP offload specification supports function variants that can be conditionally invoked instead of the base function. The implementation of this Intel-specific OpenMP offload function variant API is supported using #pragma omp target variant dispatch. This directive tells the compiler to emit a conditional dispatch code around the function call. If the target device is available, the function variant is invoked instead of the base function. Figures 5a, 5b, and 5c show an example of the target variant dispatch API. Note that the function variant must have the same arguments as the base function, plus an additional last argument of type void *.

Figure 5a. The function variant, findPeakOffload, executes on the target device.
Figure 5b. The base function, findPeak, executes on the host.
Figure 5c. Invocation would just need to be of the host version. The offload target function would be executed if target device is present; otherwise, the host version is executed.

 

Closing Remarks

The platform- and vendor-agnostic device offload support provided by the OpenMP standard makes it easier for users to target multiple heterogeneous architectures using the same code base. Therefore, we expect increasing adoption of OpenMP heterogeneous parallelism among users and hardware and software vendors.

 

Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex.