Programming Data Parallel C++

A Step-by-Step Tutorial for Developing Data Parallel C++ Kernels

[Editor’s note: This tutorial was adapted from Jeff’s GitHub repository.]

This is an introduction to the Data Parallel C++ programming model, or DPC++ for short. DPC++ is based on Khronos SYCL, which means it’s a modern C++ parallel programming model. SYCL 1.2.1 is the latest Khronos standard, although the SYCL 2020 provisional specific is available for review. Intel and other members of the SYCL working group are finalizing the next version of the language specification. DPC++ contains extensions that make SYCL easier to use, although many of these are expected to be part of SYCL 2020. Implementing extensions in the DPC++ compiler helps the community evaluate their effectiveness before standardization.

Who Is This Tutorial For?

This tutorial is for programmers who already have a decent understanding of C++ and parallelism. Teaching C++ and parallelism is hard—and there’s a lot of material out there already. There’s far less information on SYCL itself, and even less about DPC++, so that’s our focus.

SYCL is derived from OpenCL, and the execution models are quite similar. If you need help understanding the SYCL/OpenCL execution model, please check out this overview.

Who Is This Tutorial Not For?

When I tell people about SYCL, I often say, “If you like modern C++, you’ll like SYCL because it’s definitely modern C++.” A corollary to this is that if you hate C++, you’ll hate SYCL and DPC++. So, if you don’t want to write modern C++, this tutorial is not for you.

OpenMP 5.0 offers many of the same features as SYCL/DPC++ but supports the ISO language triumvirate of C++, C, and Fortran. If you want to program CPUs and GPUs using Fortran, C, or pre-modern C++ (i.e., before C++11) using an open industry standard, try OpenMP.

Another alternative to SYCL/DPC++ without the C++ is OpenCL. OpenCL is a lot more verbose than SYCL, but if you’re a C programmer, you likely prefer explicit control to syntactic efficiency.

The Tutorial

We’ll start with vector addition, which is the “Hello, world!” of HPC and numerical computation. Printing “Hello, world!” doesn’t make a lot of sense in a programming model used for doing lots of things in parallel.

Vector Addition in SYCL

The operation we’re trying to implement is SAXPY, which stands for Single-precision A times X plus Y which can be implemented in C or C++ as follows:

There are lots of ways to write this in C++. For example, we could use ranges, which would make the code look a bit more like the upcoming SYCL version. But teaching you every possible way to write a loop in C++ isn’t the point of this tutorial, and everybody understands the version that looks like C.

Here’s the same loop in SYCL. There’s a lot to unpack here, so we’ll break down in pieces:

As you might have guessed, parallel_for is a parallel for-loop. The loop body is expressed as a lambda. The lambda is the code that looks like [..]{..}.

The loop iterator is expressed in terms of a sycl::range and a sycl::id. In our simple example, both are one-dimension, as indicated by the <1>. SYCL ranges and ids can be one-, two-, or three-dimensional. (OpenCL and CUDA have the same limitation.)

It may be a bit unfamiliar to write loops like this, but it’s consistent with how lambdas work. However, if you’ve ever used parallel STL, TBB, Kokkos, or RAJA, you’ll recognize the pattern.

You might be wondering about the <class saxpy> template argument to parallel_for. This is just a way to name the kernel, which is necessary because you might want to use SYCL with a different host C++ compiler than the SYCL device compiler. In this case, the two compilers need a way to agree on the kernel name. In many SYCL compilers, such as Intel DPC++, this isn’t necessary. And we can tell the compiler to not worry about looking for names by using the option -fsycl-unnamed-lambda.

We won’t try to explain what the h in h.parallel_for is right now. We’ll cover that later.

SYCL Queues

One challenge of heterogeneous programming is the multiple types of processing elements and, often, different memory types. These things make compilers and runtimes more complicated. The SYCL programming model embraces heterogeneous execution, although at a much higher level than OpenCL. Not everything is explicit, either. Unlike other popular GPU programming models, SYCL kernels can be inlined into the host program flow, which improves readability.

Whenever we want to compute on a device, we need to create a work queue:

The default selector favors a GPU, if present, and a CPU otherwise. We can create queues associated with specific device types using this:

The host and CPU selectors may lead to significantly different results, even though they target the same hardware, because the host selector might use a sequential implementation optimized for debugging, while the CPU selector uses the OpenCL runtime and runs across all the cores. Also, the OpenCL just-in-time (JIT) compiler might generate different code because it’s using a different compiler altogether. Don’t assume that just because the host is a CPU, that host and CPU mean the same thing in SYCL.

Managing Data in SYCL Using Buffers

The canonical way to manage data in SYCL is with buffers. A SYCL buffer is an opaque container. This is an elegant design, but some applications would like pointers, which are provided by the USM extension, discussed later.

In the previous example, the user allocates a C++ container on the host and then hands it over to SYCL. Until the destructor of the SYCL buffer is invoked, the user can’t access the data through a non-SYCL mechanism. SYCL accessors are the important aspect of SYCL data management with buffers, which we’ll explain below.

Controlling Device Execution

Because device code may require a different compiler or code generation mechanism from the host, it’s necessary to clearly identify sections of device code. Below we see how this looks in SYCL 1.2.1. We use the submit method to enqueue work to the device queue, q. This method returns an opaque handler against which we execute kernels, in this case via parallel_for.

We can synchronize device execution using the wait() method. There are finer-grain methods for synchronizing device execution, but we start with simplest one, which is a heavy hammer.

Some users may find the above code a bit verbose, particularly compared to models like Kokkos. The Intel DPC++ compiler supports a terse syntax, which we’ll cover below.

Compute Kernels and Buffers

SYCL accessors are the final piece in our first SYCL program. Accessors may be unfamiliar to GPU programmers, but they have a number of nice properties compared to other methods. While SYCL allows the programmer to move data explicitly using, for example, the copy() method, the accessor methods don’t require this because they generate a dataflow graph that the compiler and runtime can use to move data at the right time. This is particularly effective when multiple kernels are invoked in sequence. In this case, the SYCL implementation will deduce that data is reused and not copy it back to the host unnecessarily. Also, we can schedule data movement asynchronously (i.e., overlapped with device execution). While expert GPU programmers can do this manually, we often find that SYCL accessors lead to better performance than OpenCL programs where programmers must move data explicitly.

Because programming models that assume pointers are handles to memory have a hard time with SYCL accessors, the USM extension makes accessors unnecessary. USM places a greater burden on the programmer in terms of data movement and synchronization but helps with compatibility in legacy code that wants to use pointers.

Review of Our First SYCL Program

Here are all the components of our SYCL SAXPY program we just described:

The full source code for this example is available in the GitHub repository at https://github.com/jeffhammond/dpcpp-tutorial.

SYCL 2020 Unified Shared Memory (USM)

While the program above is perfectly functional and can be implemented across a wide range of platforms, some users will find it rather verbose. Furthermore, it’s not compatible with libraries and frameworks that need to manage memory using pointers. To address this issue with SYCL 1.2.1, Intel developed an extension in DPC++ called Unified Shared Memory (USM) that supports pointer-based memory management.

USM supports two important usage models, both of which will be illustrated below. The first one supports automatic data movement between the host and device. The second one is for explicit data movement to and from device allocations.

The details are in the SYCL 2020 provisional specification, but to get started, all you need to know is below. The q argument is the queue associated with the device where the allocated data will live (either permanently or temporarily):

If we’re using device allocation, data must be moved explicitly (e.g., using the SYCL memcpy method), which behaves the same way std::memcpy does (e.g., the destination is on the left):

If we use USM, accessors are no longer required, which means we can simplify the kernel code above to:

You can find the complete working examples of both versions of USM in this repo, named saxpy-usm.cc and saxpy-usm2.cc, respectively.

SYCL 2020 Terse Syntax

Finally, in case you’ve been wondering why the opaque handler h was required in each of these programs, it turns out that it isn’t required after all. The following is an equivalent implementation, which was added in the SYCL 2020 provisional specification. Furthermore, we can take advantage of lambda names being optional in the SYCL 2020 provisional specification. Together, these two small changes make SYCL kernels the same length as the original C++ loop listed at the beginning of this tutorial:

We started with three lines of code that run sequentially on a CPU and end with three lines of code that run in parallel on CPUs, GPUs, FPGAs, and other devices. Obviously, not everything will be as simple as SAXPY, but at least now you know that SYCL isn’t going to make easy things hard, and it builds on a number of modern C++ features and universal concepts like “parallel for” rather than introducing new things to learn.

More Resources

For more complete information about compiler optimizations, see our Optimization Notice.