Analyzing Memory and Threading Correctness for GPU-Offloaded Code
Intel® Inspector Makes It Easy to Debug Heterogeneous Parallel Code
Modern workloads are diverse—and so are architectures. No single architecture is best for every workload. Maximizing performance takes a mix of scalar, vector, matrix, and spatial architectures deployed in CPU, GPU, FPGA, and other future accelerators. Heterogeneity adds complexity that can be difficult to debug. This article introduces the new features of Intel® Inspector that support the analysis of code that’s offloaded to accelerators.
Intel® Inspector Overview
Memory errors and nondeterministic threading errors are difficult to find without the right tool. Intel Inspector is designed to find these errors. It’s a dynamic memory and threading error debugger for C, C++, DPC++, and Fortran applications that run on Windows or Linux operating systems.
Figure 1 shows the types of problems that Intel Inspector finds:
- Memory errors including leaks, invalid access, and more
- Persistent memory errors such as missing or redundant cache flushes
- Threading errors such as data races and deadlocks
It’s easy to use, reliable, and accurate. No special recompilation is required. You can use your normal debug or production build to catch and debug the errors. Intel Inspector can analyze dynamically generated or linked code and inspect third-party libraries, even when source code isn’t available. It breaks into the debugger just before the error occurs. Automated regression analysis is possible using the command-line option.
How to Analyze Your Offloaded Code Using Intel Inspector
Correctness analysis is more complicated when offloading code to an accelerator. Our experiences with DPC++ uncovered the need for a tool to assist in debugging offload issues. The current version of Intel Inspector introduces an important approach called “early interception.” It means that for offloaded code, it intercepts some problems in the early stage before kernel execution. Tables 1 and 2 list the offload issues that can be detected using Intel Inspector. Data races on shared data are reported, but there are limitations:
- DPC++ barriers and OpenMP synchronizations are ignored. The tool will report false positives even if work-items are synchronized.
- Data races are not detected on variables defined in kernel local memory.
- The instructions below set up your application to run on your CPU, but some GPU analysis is supported using early interception.
Step 1 is to configure your application to run on the host CPU:
Next, configure OpenMP applications to run kernels on the CPU device:
Verify that the application works correctly before running the analysis. Enable code analysis and tracing in the JIT compiler/runtimes:
Set up the Inspector environment
Step 2 is to run an analysis on a small workload using either the GUI (inspxe-gui) or command-line (inspxe-cl) clients. Perform analysis using the command-line client as follows:
View the results as follows:
Alternatively, you can view the results in the GUI:
You can also launch an analysis and view the results in the GUI (Figures 2 and 3).
Usage Example: Using a Host Pointer on the Device
The following code contains a memory problem:
Launch this code using the Intel Inspector command-line client and view the analysis results in the GUI (Figure 4):
Usage Example: Finding a Data Race
The following code contains a race condition:
Launch this code using the Intel Inspector command-line client and view the analysis results in the GUI (Figure 5):
Table 1. Memory issues that Intel Inspector can detect
Table 2. Intel Inspector can detect data races.
oneAPI provides a standard, simplified programming model that can run seamlessly on the scalar, vector, matrix, and spatial architectures deployed in CPUs and accelerators. It gives users and domain experts the much-needed freedom to focus on the code itself and not the underlying mechanism that generates the best possible machine instructions. Correctness analysis tools like Intel Inspector provide much-needed assistance in debugging difficult-to-detect threading and memory issues.