# Solving a 2D Heat Equation Using Data Parallel C++ (DPC++)

## A Step-by-Step Case Study Porting a C and MPI Application to DPC++

The heat equation is a problem commonly used in parallel computing tutorials. In fact, we start from one such exercise published by the Partnership for Advanced Computing in Europe (PRACE). The original code1 describes a C and MPI implementation of a 2D heat equation, discretized into a single point stencil (Figure 1). The 2D plane is divided into cells, with each cell being updated every timestep based on the previous values of itself and its four neighbours. A more detailed explanation of the problem can be found on the PRACE repository2.

Figure 1. The 2D heat equation and single point stencil

The default implementation starts with a 2000 x 2000 cell plane that evolves over 500 steps in time (Figure 2). The plane is initialized to a uniform temperature except for a disc in the center that has a different uniform temperature. External edges are fixed at four different temperatures.

Figure 2. Setting up the computation

In this article, readers will learn how we can take code written in standard C/C++, add some simple DPC++ constructs, and run on different parallel processing units. Familiarity with DPC++, SYCL, and oneAPI is assumed.

### Initial Port to DPC++

The original PRACE code features MPI constructs to divide the problem into tiles that are processed across multiple processors and multiple nodes. These are initially removed from the code to focus on the core computation. They will be reintroduced later. Other features, such as regular writing of output images and restart checkpoints, are also removed to simplify the problem. Te code associated with writing images depicting the initial and final planes were kept as a useful functional verification tool.

A further simplification made initially was to update the whole of the plane in a single kernel. The MPI-based code separates the edges and interior so that the interior can be updated while the halo is being copied from other processes. The edge from one tile becomes the halo from a neighboring tile. Once the halo is updated, the edges can be computed, which are dependent upon both the halo and the interior (Figure 3). In this initial port, the edges and interior are treated as one.

Figure 3. Data layout and exchange

In order to use DPC++, two header files need to be included. The first provides support for the DCP++ language, is provided with the DPC++ compiler, and is referenced in main.c, core.c, and heat.h. The second header file is taken from the collection of DPC++ sample programs3 and is included for the exception handler; this header file is only referenced from main.c. We also declare the sycl namespace to simplify SYCL constructs in the code body:

Other changes to main.c involve the main time iteration loop, which is wrapped top and tail with DPC++ code. The following five code segments represent contiguous code but are separated here to aid the description.

### Device and Queue

A default selector is chosen, which means that the runtime will select the target device to run the kernel on. Usually this will result in a GPU being used, if present, or the host CPU otherwise. If we wanted to force a certain device, we could use cpu_selector or gpu_selector instead of default_selector. A queue is then defined based upon the selector and an exception handler which is wrapped in a try-catch block.

### Buffers

Buffers provide containers for data that is present on the target device and are familiar to OpenCL programmers. We take the size of the problem from the data structure found in the original code, adding two to each dimension for the halo (one cell on each side) (Figure 3). global_range is declared as a one-dimensional range based upon our problem size and dimensionality of the original host array. The buffers are then defined, referencing this range and pointers to the host data:

### Queue Submission and Accessors

The time evolution for-loop is unmodified from the original code. For each iteration of the loop, we make a queue submission based upon the queue defined above. Accessors define how the buffers can be accessed on the device. In this case, we simply declare read_write access for both of our buffers. Although only one buffer is read from and one is written to during each loop iteration, we swap the buffers while they remain in context so we cannot declare them as read or write.

### Kernel Execution

We use a parallel_for kernel, meaning that the body of this section will be submitted for every item in our problem (i.e., every cell in the plane, including the halo in this case). A one-dimensional id is defined that will be passed to the kernel identifying the cell being calculated. We could have used a two-dimensional ID, which would aid indexing within the kernel body, but opted for a one-dimensional ID to show a functional port with minimal code changes. The body of the kernel is contained within the function evolve in a separate file, core.c, which is described later. The original code has one call point to the equivalent evolve function and then swaps the pointers within the loop. Here we alternate pointers to our accessors on each loop iteration via separate parallel_for calls to make buffer management simpler:

### Error Handling

Any errors that occur during kernel execution are passed to the host application scope and are handled there with conventional C++ exception handling techniques. Here, we simply rely upon the standard exception handler that is provided in the sample programs:

### Kernel Code

The following code is the complete listing for core.c. It is vastly simplified from the original code since we are not worried about the inter-process communication or separating the interior from the edge calculation. In the original code, as with traditional CPU programming, the evolve function is called once for the entire plane and two nested for loops traverse the x and y dimensions, calculating the update for each cell. Due to the use of our parallel_for call above, the function below will be called for each individual cell. Therefore, there are no for-loops.

The global range used when calling parallel_for included the halo, which is not updated but is needed to calculate the new values for the edges. The if statement is used to omit the halo so that we do not attempt to read memory outside of the buffers, which would result in a segmentation fault. The actual calculation is confined to the last line of code and is very similar to the original code. Various intermediate variables are declared purely to aid readability.

There is a subtle change to the function parameters. The most obvious is the inclusion of the ID, which is needed to identify which cell is being calculated. We also pass in dx2 and dy2 to the kernel rather than dx and dy.dx2 and dy2 were originally calculated with the kernel, but their values remain constant across all cells. Since our new evolve function is called for every cell, calculating these values each time would be an unnecessary overhead. We also need to declare the function as SYCL_EXTERNAL to tell the compiler that this is kernel code, something that is not obvious from the body. The function prototype in the header file is also modified accordingly:

### Compiling the Application and Kernel

The original code uses the mpicc compiler wrapper. To compile for DPC++, we use the dpcpp compiler and modify the Makefile accordingly. However, the files associated with writing out PNG files require a C compiler so we use GCC for this and wrap the function prototype with extern “C” {…}. These files reference libpng4, which can be installed with sudo apt-get install libpng-dev or downloaded and built from source.

### Targeting FPGA

The code and modified Makefile above are appropriate for targeting CPU and GPU. For FPGA we make a few minor changes in the source code and the Makefile.

main.c is modified to select the fpga_selector when FPGA is defined:

For the Makefile, there are just a few extra flags for the compiler and linker. We have also changed the executable name so that it remains separate from the CPU/GPU version.

### Running the Application on the Intel® DevCloud

The Intel® DevCloud5 gives users the opportunity to try oneAPI and DPC++ for free with a preinstalled environment and access to multiple Intel CPU, GPU, and FPGA technologies. The following assumes that access to Intel DevCloud is already available and configured as per the Getting Started Guide6.

Before we can compile and run our code, we need to install libpng. As sudo access is not available on the Intel DevCloud, we do this by compiling from source. Download the source code from Reference 7 and copy to the Intel DevCloud, then perform the following steps:

Then add the following to ~/.bashrc, or run these commands upon any new session. The environment variable LIBPNG_ROOT can then be used to reference the library in the Makefile:

To compile and run our heat equation, we can use interactive sessions on the relevant compute nodes for CPU and GPU using one of the following commands, respectively:

Then, we simply make and run the code before exiting the session:

Because FPGA can take a long time to compile, it’s best to submit this as a batch job, targeting one of the FPGA compile nodes as follows:

It’s also advisable to make clean between CPU/GPU compiles and FPGA or compile in a separate directory to avoid conflicts. The contents of compile_fpga.sh are as follows and the file must be executable:

Note that we have kept a separate Makefile for the FPGA compilation. The status of the job can be checked using the qstat command. Once completed and successful, the kernel can be run on the FPGA using an interactive session, which is launched as follows:

### Functional Verification

At the end of execution, the host application prints out the final value of a specific cell to the terminal as well as outputting a PNG image of the final plane (Figure 4). The specific value provides a quick check against different code iterations and kernel targets. This value is the same across runs on CPU, GPU, and FPGA and matches the value of the original code. The PNG image can be compared using third-party applications and shows no differences across the three technologies or the original code.

### MPI Implementation

Before adding the MPI multi-process support back into the code, we need to think about how to divide the problem between host and device, as well as how to parallelize the operations. A reasonable starting point is to have the host deal with the edge calculations and transfer between processes and have the device calculate the interior. This also aligns with the approach in the original code. During the time iteration loop, the interior can be evolved independently of the edges with a synchronization point prior to the swapping of pointers at the end of each iteration (Figure 5). The two rows can operate in parallel.

Another important consideration is how the problem is decomposed among MPI processes. The original code uses a two-dimensional decomposition, meaning that with four processes, the problem will be divided into four quadrants (2 x 2 tiles). The data transfers between host and device will be complex and comprise many individual PCIe transactions for the left and right edges, if we transfer only the data needed. It would be more efficient to transfer the whole data planes back and forth between each iteration of the interior evolution on the device. This is because one large transfer is faster than many small transfers. However, a better way would be to decompose the problem in one dimension only. This way, we can transfer only the data needed between host and device and those transfers will require just two lots of contiguous memory (top and bottom edges) (Figure 6). Figure 7 shows the relative execution time normalized to the time taken to transfer all data. Note that because the global edges of this problem are fixed, there is no need to transfer left and right edges for the one-dimensional implementation, so the amount of data transferred for 1D and 2D decompositions is the same.

Figure 6. 1D versus 2D data decomposition

We can force a one-dimensional decomposition by simply changing the dimensions passed to the MPI_Dims_create function in setup.c. A zero passed to this function allows the runtime to define the range of each dimension. By using a one as the second value, we force a single column implementation:

The buffer implementation used in our previous implementation transfers data to the device implicitly prior to execution and then back again when the buffers go out of context. To be able to transfer data back and forth within the loop, we need finer control. The Unified Shared Memory (USM) model allows data to be allocated as device, host, or shared and enables familiar C++ constructs to interact with memory. We will use device allocated memory, which enables explicit control over data transfers between host and device.

Taking the original code as the starting point and keeping all the MPI code, the following changes are made. The header and namespace changes are omitted here, since they are the same as before. The following five sections represent contiguous code.

### Device Memory Allocation

The device selector and queue are as per the first implementation. Instead of using buffers we use malloc_device to explicitly define memory on the device, which is then copied over with explicit memcpy functions. The queue needs to be referenced to provide the device context. The wait function waits for the previous memory operations to complete. This is necessary because host execution continues after submitting the requested action to allow parallel operation between host and accelerator. The wait function serves as a synchronization point to ensure that all actions in the queue, which may execute out of order, are complete before host execution continues. In the previous example, the SYCL runtime took care of scheduling operations using buffers.

### Copy Edges to Device

At the beginning of each timestep, the top and bottom edges of each slice are copied from host to device using pointer math to identify the edges, which are one row below and above the top and bottom, respectively, due to the halo. Note that the copy operation is initiated, then the exchange function, and then the wait so that the copy and exchange can operate in parallel.

### Kernel Execution

In this implementation, the device evolves the interior rather than the whole plane and so the function name for the compute is changed accordingly. The other subtle difference is that the device memory pointers are passed in instead of the pointers to buffer accessors. The exchange_finalize function is from the original code, which waits for the MPI-based data exchange to complete before evolving the edges. We then wait for the kernel to complete before copying the edge of the interior back to the host.

### Pointer Swap

The final step for each time iteration is to swap the pointers. On the host this is handled by the swap_fields function from the original code. For the device, we perform a simple pointer swap.

### Completion

To finalize the kernel operation, we copy both planes back to the host, free the previously allocated device memory and catch any exceptions that may have occurred.

### Kernel Code

Kernel code looks very similar to the previous implementation. The only difference is in the if statement, which now excludes both the halo and edges from the top and bottom, but only the halo from the left and right edges.

Since the left and right edges are now evolved using the device kernel, we need to remove this operation from the host CPU in the evolve_edges function. This function comprises four for-loops, one for each edge. The left and right edges are the last two, which we simply comment out.

### Compiling and Running with MPI

We use the Intel dpcpp compiler but add additional flags to include and link in the Intel MPI library:

To run the code, we use mpirun and pass in the number of processes and the executable:

Both the single-point value and PNG comparisons confirm that this implementation functionally matches the previous implementation and the original code. This remains true when run on a single node with multiple processes and across multiple nodes using multiple GPUs.

### Conclusions

With a few basic changes to the code, we have managed to take a standard problem and convert it to a DPC++ implementation using buffers and accessors. We can use the Intel DevCloud to compile code and run on multiple targets including CPU, GPU, and FPGA, with each of those targets producing the same functional results as the original implementation. Finally, we showed how to use USM to provide explicit control of memory transfers between the host and device, necessary to manage the data flow between multiple processes using MPI.

Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex.