GPU-Quicksort*

How to Move from OpenCL™ to Data Parallel C++

Data Parallel C++ (DPC++) is a heterogeneous, portable programming language based on the Khronos SYCL* standard. This single-source programming language can target an array of platforms: CPUs, integrated and discrete GPUs, FPGAs, and other accelerators. To give you an idea of what DPC++ can do, we’ll port a non-trivial OpenCL™ application, GPU-Quicksort*, to DPC++ and document the experience. Our goal will be to exceed the capabilities of the initial application. OpenCL™ C makes it very hard to write generic algorithms, and it becomes clear that it’s a serious shortcoming when you try to implement algorithms—like sorting—that need to work for different data types. The original GPUQuicksort for OpenCL™ was written to sort unsigned integers. We’ll demonstrate how to use templates with DPC++ and implement GPU-Quicksort for multiple data types. Finally, we’ll port GPU-Quicksort to Windows* and Ubuntu* 18.04 to show DPC++ portability.

What’s GPU-Quicksort?

GPU-Quicksort is a high-performance sorting algorithm designed specifically for highly parallel, multicore graphics processors. It was invented in 2009 by Daniel Cederman and Phillippas Tsigas, a student and professor from the Chalmers University of Technology in Sweden. Originally implemented in CUDA*, it was reimplemented in 2014 in OpenCL™ 1.2 and OpenCL™ 2.0 by me to demonstrate high performance on Intel® Integrated Processor Graphics and showcase nested parallelism and work-group scan functions in OpenCL™ 2.0 and fully implemented in Intel OpenCL™ drivers. We’ll port an OpenCL™ 1.2 implementation of the GPU-Quicksort to DPC++ and make the implementation generic so that it can sort not just unsigned integers, but also floats and doubles.

What’s OpenCL™?

We’ll start with the OpenCL™ 1.2 implementation. Intel fully supports OpenCL™, a Khronos standard for programming heterogeneous parallel systems, on a variety of operating systems and platforms. OpenCL™ consists of:

  • The runtime
  • The host API
  • The device C-based programming language OpenCL™ C

Here lie both its power and its limitations. The power is the ability to write high-performance, portable, heterogeneous parallel applications. Its main limitation is the necessity to write and debug two separate codes—the host side and the device side—as well as the lack of templates and other C++ features modern programmers are accustomed to, which makes writing generic libraries in OpenCL™ hard.

What’s Data Parallel C++?

DPC++ is an Intel implementation of Khronos SYCL* with extensions. The SYCL standard designed to address the OpenCL™ limitations outlined above. DPC++ provides:

  • A single-source programming model, which consists of a single code base for both host and device programming
  • The full use of C++ templates and template metaprogramming on the device with minimal impact on performance without compromising portability

DPC++ lets a programmer target CPUs, GPUs, and FPGAs while permitting accelerator-specific tuning—a definite improvement over OpenCL™. It’s also supported by Intel® software tools like Intel® VTune™ Profiler and Intel® Advisor, as well as by GDB*. We’ll make full use of DPC++, especially its template features.

The Starting Point: Windows* Apps from 2014

We’ll start with GPU-Quicksort for OpenCL™ 1.2 (as described in the article GPU-Quicksort in OpenCL™ 2.0: Nested Parallelism and Work-Group Scan Functions). The original application was written for Windows, so we port it to Ubuntu 18.04 by adding the cross-platform code to measure time and use aligned_alloc/free for aligned memory allocation/deallocation, as opposed to _alligned_malloc/_aligned_free on Windows.

Let’s get a brief overview of GPU-Quicksort architecture. It consists of two kernels:

  1. gqsort_kernel
  2. lqsort_kernel

Written in OpenCL™ 1.2, these are glued together by a dispatcher code, which iteratively calls gqsort_kernel until the input is split into small enough chunks, which can be fully sorted by lqsort_kernel. The application allows the user to select:

  • The number of times to run sort for measurement purposes
  • The vendor and device on which to run the kernels
  • The size of the input
  • Whether to show device details

The application follows a typical OpenCL™ architecture of supporting utilities for initializing OpenCL™ platforms and devices and building code for them. A separate file, with the OpenCL™ kernels and their supporting functions and the main application that accepts user arguments, initializes the platform and device, builds the kernels, properly allocates memory, and creates buffers from it, and then binds them to the kernel arguments and launches the dispatcher function.

Data Parallel C++/OpenCL™ Interoperability: Platform Initialization

First, install the Intel® oneAPI Base Toolkit, which includes the Intel® oneAPI DPC++ Compiler. We start our port to DPC++ by including the CL/sycl.hpp header and, to spare us the verbosity of DPC++, using the namespace cl::sycl clause:

Now, instead of initializing a platform, a device, a context, and a queue the OpenCL™ way, we do it the concise DPC++ way:

We also need to retrieve the underlying OpenCL™ context, device, and queue, since the rest of the application is OpenCL™ based:

That’s our first iteration: configure and compile it with the Intel DPC++ Compiler and run it.

Data Parallel C++: How to Select an Intel GPU

The shortcoming of the first iteration is that it always selects the default device, which may or may not be an Intel GPU. To specify an Intel GPU, we need to write a custom device selector:

We use intel_gpu_selector to select an Intel GPU when the user asks for it:

Data Parallel C++: How to Set Kernel Arguments and Launch Kernels

The third iteration of our code uses DPC++ to set kernel arguments and launch kernels. The program is still built, and the kernels are obtained, in the OpenCL™ way. We use cl::sycl::kernel objects to wrap original OpenCL™ kernels. For example:

We replace a number of clSetKernelArg methods with set_arg DPC++ methods and clEnqueueNDRange calls with parallel_for calls. This example below shows gqsort_kernel, but a lqsort_kernel upgrade is very similar:

Here’s a less verbose style to set all the arguments of the kernel with one set_args call:

We can also use a less verbose version of the parallel_for:

Data Parallel C++: How to Create Buffers and Set the Access Mode

We convert OpenCL™ buffers to DPC++ buffers. (The first two are wrapping the memory that was alignallocated and passed into the function by reference. The other three are created from an STL vector.) We use the template keyword in front of the get_access member function for buffers that we pass by reference. Note the different access modes for various buffers, depending on whether we need read- or write-access, or both. We don’t directly pass buffers as kernel arguments; we pass the accessors to them:

Data Parallel C++: How to Query Platform and Device Properties

In OpenCL™, we used the methods clGetPlatformInfo and clGetDeviceInfo to query various platform and device properties. Now we can use get_info<> methods to query the same information. For example:

or query properties with a more complex structure:

Porting OpenCL™ Kernels to Data Parallel C++, Part 1: gqsort_kernel

So far, we’ve initialized the platform and the device, created the buffers and their accessors and bound them to the kernels, and launched those kernels on the device in a DPC++ way. But we still need to create the kernels in an OpenCL™ way. We use OpenCL™ C and clBuildProgram/clCreateKernel APIs to build the program and create kernels. The OpenCL™ C kernels are stored in a separate file that’s loaded into the program at runtime before being built. We’ll change that, starting with the gqsort_kernel, the simpler of the two kernels.

The DPC++ way of creating kernels is via lambdas or functors. The use of lambdas for kernel creation is typically reserved for smaller kernels. When you have a more complex kernel that uses supporting functions, it’s a good idea to create a functor class. We’re going to create a gqsort_kernel_class functor and make it templated right from the start so that we can sort more than one datatype in the future.

A typical functor class will have a void operator() that will take as a parameter an iteration id (in our case, a one-dimensional nd_item<1> id). The body of the kernel will reside in the void operator(). The functor will also have a constructor that will take global and local accessors, the equivalent of global and local memory pointers for an OpenCL™ kernel. The typical DPC++ functor will have a preamble, with using clauses defining various global and local accessor types. In the case of gqsort_kernel, it will look like this:

The private section of the functor will contain all the global and local accessors used within the body of the void operator(). In our case, it will look like this, with the first five accessors to global buffers and the rest to the local buffers:

gqsort_kernel is a complex kernel that uses supporting structs and two supporting functions: plus_prescan and median, which, in turn, use specialized OpenCL™ functions and extensively use local memory arrays and variables, local and global barriers, and atomics. All these elements must be translated into DPC++.

Let’s start with the functions. We omit structs, since they’re trivially templatized. The plus_prescan function that’s used to calculate scan sums is relatively simple, so the only change we’ll make to bring it to DPC++ is to make it a template function in preparation of making our sort generic:

The median function is next. We not only need to make it a template function, we also need to replace the OpenCL™ C select function with the DPC++ cl::sycl::select function and rename it median_select to differentiate it from a similar host function:

In OpenCL™ C, it’s possible to both create local memory variables and arrays inside the body of the kernel and pass them as kernel parameters. But in DPC++, when using functors, we pass local buffer accessors when constructing the functor. In our case, all local memory variables and arrays will hold unsigned integers, so we’ll create a special local_read_write_accessor type:

We declare all the local memory variables:

We then pass them as parameters, along with global buffer accessors, to our functor constructor. Then the resulting object is passed to the parallel_for:

Here, DPC++ lacks simplicity compared to OpenCL™ C. Next, get_group_id and get_local_id functions become:

Local barriers go from:

to:

Global and local barriers go from:

to:

For atomic operations, DPC++ is not as elegant as OpenCL™ C. So, what was concise:

becomes unwieldy:

Note the creation of cl::sycl::atomic<> variables prior to the use of DPC++ atomic operations, which cannot operate on the global or local memory pointers directly. So far, we’ve translated and templatized supporting structs and functions, converting specialized OpenCL™ C functions to DPC++. We’ve also created a template functor class with local accessors and translated barriers and atomics.

Porting OpenCL™ Kernels to Data Parallel C++, Part 2: lqsort_kernel

Translation of lqsort_kernel follows the familiar patterns outlined by the translation of gqsort_kernel: create a lqsort_kernel_class functor and then translate local memory arrays and variables and barriers (no atomics here). lqsort_kernel also uses supporting functions and structs. In addition to plus_prescan and median_select used by gqsort_kernel, we have bitonic_sort and sort_threshold that are considerably more complex and specific to lqsort_kernel. After translation, they become the member functions of the lqsort_kernel_class. Their signatures change due to barrier use which, in the case of DPC++, requires the iteration objects. They work on local and global memory pointers, which require special handling so the OpenCL™ C signature:

becomes:

and:

becomes:

These functions are translated similarly to gqsort_kernel, with the UINT_MAX macro being replaced with std::numeric_limits::max() to handle various data types in the future. When translating the lqsort_kernel, pointers to local memory (e.g., local uint* sn;) are replaced with local_ptr<> objects (e.g., local_ptr sn;). To retrieve the local pointer from the local accessor, we call the get_pointer member function of the accessor:

local_ptr<> and global_ptr<> objects work with pointer arithmetic, so what previously was d + d_offset, where d was a global pointer, becomes:

We translate local memory variables as accessors of size 1, meaning array accesses at index 0 (e.g., gtsum[0]). When we complete the lqsort_kernel translation, we fully transition to DPC++, but still sort unsigned integers. We did all the prework of templatizing supporting structs and functions and the functor classes of the two main kernels—and will enjoy the benefits.

The Power of Data Parallel C++: Templates…And Their Caveats

The real power of DPC++ is the ability to use C++ templates, which enable writing generic code. We want our GPU-Quicksort to be generic and to be able to sort not only unsigned integers, but also other basic data types (e.g., floats and doubles). In addition to the UINT_MAX to std::numeric_limits::max() change mentioned above, we need additional modification of the median_select function. cl::sycl::select takes a different type of the third argument, depending on the size of the type of the first two arguments, so we introduce the select_type_selector type traits class:

It allows us to convert a Boolean comparison to an appropriate type required by cl::sycl::select; median_select becomes:

To handle additional types, we need more specializations of select_type_selector. Now GPUQSort can sort floats and doubles on the GPU.

Back to Windows…And RHEL*

To demonstrate DPC++ portability, we port the code to Windows and RHEL*. The RHEL port is minimal: we add the Intel imf math library at link time. Windows porting is slightly more complex. Add the following definitions when compiling:

Accounting for the fact that cl::sycl::select for doubles requires unsigned long long type as the third parameter (as opposed to unsigned long on Linux), select_type_selector for doubles becomes:

On Windows, we undefine max and min to prevent the macro definitions from colliding with std::min and std::max. That’s all there is to it. We can sort unsigned integers, floats, and doubles using Intel GPUs on Windows and two Linux flavors.

Get Started Now

We gradually translated GPU-Quicksort from its original OpenCL™ 1.2 into DPC++. At every step along the way, we had a working application. So, when you’re considering bringing DPC++ to your workflow, start small and either add on or fully transition to DPC++ as time allows. Easily mix OpenCL™ and DPC++ in your code base and enjoy the benefits of both. Use legacy OpenCL™ kernels in their original form and enjoy the full power of C++ templates, classes, and lambdas when you’re developing new code in DPC++. Easily port code between Windows and various Linux flavors and choose which platform to develop on. You also have the full power of Intel tools to help you debug, profile, and analyze your DPC++ program.

Resources

For more complete information about compiler optimizations, see our Optimization Notice.