DPC++ Data Management across Multiple Architectures
Part 2: Ordering Execution Using the oneAPI Programming Framework
In Part 1 of our Data Parallel C++ (DPC++) Data Management blog, we looked at data movement and control when targeting cross-architecture systems that include hardware accelerators—such as in the oneAPI programming model. We introduced Unified Shared Memory (USM) and buffers, both of which can help programmers achieve their portability and performance goals with DPC++ (oneAPI’s cross-architecture language). In this blog, we detail how the executions of DPC++ kernels are ordered relative to each other.
And if you missed our first blog and need a primer on DPC++, read Dive into DPC++ for an inside look at the language and its kernel-based approach.
Ordering Execution – Just like Everyday Life
In DPC++, ordering the execution of kernels is important because kernels may execute concurrently on one or multiple hardware accelerators, and kernel code executes asynchronously from the application’s host code. Without properly ordering the executions, one kernel may start operating on data while another kernel is still using it, leading to data corruption and hard-to-find bugs—experiences that are best avoided!
Kernels are asynchronous tasks submitted to a queue for execution on a device, and often must execute in a specific order—just like tasks in everyday life. For example, if you are hungry for dinner, you may first need to go to a grocery store to buy ingredients (task A) and then prepare a dish (task B). When an ordering requirement exists between two tasks, we say that a dependence exists between them. In our example, since you can’t prepare a dish until after buying ingredients, there is a dependence between these two tasks.
For kernels, the dependences are usually driven by the data that a kernel is operating on (reading from or writing to). For example, any data that a kernel reads from to do its work needs to be produced and available on the device before the kernel can execute. These data dependences mean that synchronization and data transfers are required before execution of most kernels. The data transfer tasks may be explicit programmer-specified copy operations or implicit data movements performed by the DPC++ runtime.
You can visualize all your program’s device tasks and the dependences between them as a directed acyclic graph (DAG)—the nodes in the graph are the tasks to execute on a device, and the edges are the dependences between tasks.
Figure 1: Simple Task Graph
In Figure 1, task A (Go to the Store for Ingredients) must execute before tasks B (Make Salad), and C (Make Pasta). Likewise, B and C must execute before task D (Eat!). Because tasks B and C are independent, they do not have a dependence between them. This means the runtime is free to execute them in any order, or even in parallel, so long as task A has finished executing first.
Tasks may have dependences with just a subset of other tasks. When two subsets of tasks are independent, if we only specify the dependences that matter for correctness, then the runtime has maximum freedom to optimize the execution order.
Figure 2: Task Graph with Disjoint Dependences
In Figure 2 we add a few household chores to our evening, in the form of tasks E (Do Laundry) and F (Fold Clothes). Task E must execute before F, but tasks E and F have no dependences with A, B, C, and D. If all of these tasks are submitted to a queue for execution, there are many valid execution orderings. For example, the tasks could be executed purely in-order from A to F, or the execution of tasks E and F could be interleaved within tasks A to D if, say, you want to take a break and do laundry after shopping. In cases where there are no dependences between tasks, such as for tasks B, C, and E, perhaps you could recruit a roommate or family member and execute the tasks in parallel!
There are two different ways you can model the execution of tasks in a queue:
- Strictly in the order of submission, where there are implicit dependencies between tasks constraining the order of executions (not necessarily based on the data accessed by a task); or
- Execution of tasks in any order, where the dependences must be described explicitly, typically based on the data accessed by a task.
Figure 3: ordered_queue Usage
The simplest option in DPC++ is to submit tasks to an in-order queue object, as seen in Figure 3, where each task will wait for the previous tasks to be complete before starting. While in-order queues are intuitive and a good fit for many algorithms, the disadvantage is that tasks in an in-order queue will never execute in parallel, even when there are no true dependences between them. Executing tasks in parallel can help improve the performance of some algorithms, so out-of-order queues are the most commonly used in real applications.
Out-of-Order (OoO) Queues
The other option in DPC++ is to submit tasks to a queue object that is an out-of-order queue, meaning that tasks may execute in any order so long as any dependences between tasks are satisfied. The dependences between tasks must be described explicitly.
Dependences are described via a command group, which includes both the task itself and its dependences. Command groups are typically written as C++ lambdas to keep the code brief and are passed as arguments to the submit() method of a queue object. Broadly speaking, the dependences may be represented as:
- Data dependences related to the task in the command group, defined using accessor objects; or
- Execution dependences on other tasks in the graph, represented by event objects.
Dependences with Events
The event mechanism explicitly adds dependences between two tasks in a graph by making one task depend on the completion of another, where completion is represented by an event object. It is most relevant for tasks that operate on USM allocations, which is referenced through pointers instead of accessors. To explicitly add a dependence between tasks, use the depends_on() method that accepts either a single event or a vector of events. The depends_on() method informs the runtime that the command group requires the specified events to complete before the task may execute. Figure 4 shows an example of how depends_on() orders tasks:
Figure 4: Using Events and depends_on
Figure 4 also shows a helpful technique for debugging: To serialize parts of a graph, you can also synchronize through the host by explicitly calling the wait() method on the event, which forces the runtime to wait for the task to complete before host program execution continues. Waiting on the host can simplify debugging of applications, but it also constraints the asynchronous execution of tasks by coupling them with the host program’s execution, so it is not recommended as a general-purpose solution in high-performance code. Similarly, you could call wait() on a queue object to block execution on the host until all enqueued tasks in the queue have completed.
Most code should not use wait() or similar constructs to order task executions but should rely on either the depends_on() mechanism or accessors (described next), which allow the DPC++ runtime to manage the details of task ordering and sequencing for you.
Implicit Dependences with Accessors
Implicit dependences between tasks in DPC++ are often created from data dependences. Data dependences between tasks take three forms, shown in Figure 5.
Figure 5: Three Forms of Data Dependences
Data dependences are expressed to the runtime through accessors. The order that you create accessors, in the program order, creates an implicit ordering of the tasks using those accessors that the runtime manages for you. An example is shown in Figures 6 and 7.
Figure 6: Read-After-Write
Figure 7: RAW Task Graph
In Figures 6 and 7, three kernels execute: computeB, readA, and computeC, and then the final result is read on the host. The command group for kernel computeB creates two accessors, accA and accB. Kernel computeB reads from buffer A through accA and writes to buffer B through accB. The accessors give the DPC++ runtime the information that it needs to order the executions of the kernels correctly, and on when it needs to move data to specific memories (to be accessible by kernels running on different devices). For example, before readA or computeB start to execute, buffer A must be copied from the host to the device and made available for use by the kernels.
The readA kernel creates a read-only accessor for buffer A, which creates a Read-after-Read (RAR) dependence because kernel readA is submitted after kernel computeB. RAR dependencies do not place extra restrictions on the runtime, and the kernels can therefore execute in any order (including at the same time) depending on what the runtime determines will be most efficient.
Kernel computeC reads buffer B, which is the output of computeB. Since kernel computeC was submitted after kernel computeB, kernel computeC has a RAW data dependence on buffer B, meaning that computeB must finish execution and its output be made available as input to computeC, before computeC begins to execute.
Finally the program also creates a RAW dependence on buffer C between kernel computeC and the host, because the host needs to access buffer C after the kernel finishes. This forces the runtime to copy buffer C back to the host after computeC completes execution, but the key is that the runtime does all of this for you automatically behind the scenes! If you use accessors, data movement and scheduling of kernel executions for correctness are automatic!
Figure 8: Write-After-Read and Write-After-Write
Figure 9: WAR and WAW Task Graph
In Figures 8 and 9, the program executes kernels computeB, rewriteA, and rewriteB. Kernel computeB once again reads from buffer A and writes to buffer B, kernel rewriteA writes to buffer A, and kernel rewriteB writes to buffer B. Kernel rewriteA could execute earlier than kernel computeB because less data needs to be transferred before the kernel is ready, but it has to wait for kernel computeB to finish because there is a WAR dependence on buffer A.
While RAW dependences ensure that data flows in the correct direction, WAR dependences ensure existing values are not overwritten before they are read. The WAW dependence on buffer B functions in a similar way—if there were any reads of buffer B submitted in between kernels computeB and rewriteB, they would result in RAW and WAR dependences that would properly order the tasks. However, there is an implicit dependence between kernel rewriteB and the host in this example because the final data must be written back to the host. The WAW dependence ensures that the final output will be correct.
We know that many DPC++ users are actually OpenCL and SYCL gurus, and we have exciting news to share ahead. We’re working on integrating DPC++ extensions such as USM back into the Khronos standards. We’ll provide more details soon.
Ben Ashbaugh is an Intel software architect focused on parallel programming models for general purpose computation on Intel graphics processors, including oneAPI and DPC++. He has authored numerous OpenCL™ extensions and is an Intel representative within The Khronos Group, where he contributes to the OpenCL, SPIR-V™, and SYCL™ industry standards.
James Brodman is an Intel software engineer researching languages and compilers for parallel programming. He is currently focused on the oneAPI initiative and DPC++, and he has written extensively on programming models for SIMD/vector processing, languages for parallel processing, distributed memory theory & practice, programming multi-core systems, and more.
Mike Kinsner is an Intel software engineer researching and developing parallel programming models for a variety of architectures. He also works on compilers for spatial architectures and is an Intel representative within The Khronos Group where he contributes to the SYCL™ and OpenCL™ industry standards.
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Intel technologies may require enabled hardware, software or service activation. No product or component can be absolutely secure.
Your costs and results may vary.
Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.