Optimizing the Performance of oneAPI Applications
Getting the Most from this Unified, Standards-Based Programming Model
Modern workloads are incredibly diverse—and so are processor architectures. No single architecture is best for every workload. Maximizing performance takes a mix of scalar, vector, matrix, and spatial (SVMS) architectures deployed in CPU, GPU, FPGA, and future accelerators. Intel® oneAPI products will deliver what you need to deploy your applications across SVMS. This set of complementary toolkits—a base kit and specialty add-ons—simplifies programming and helps you improve efficiency and innovation.
The Intel® oneAPI Base Toolkit (beta) includes advanced analysis and debug tools for profiling, design advice, and debugging:
- Intel® VTune™ Profiler (beta) finds performance bottlenecks in CPU, GPU, and FPGA systems.
- Intel® Advisor (beta) provides vectorization, threading, and accelerator offload advice.
- Intel-enhanced GDB* (beta) helps efficiently debug code.
Performance Analysis Tools
This article will focus on Intel Advisor (beta) and Intel VTune Profiler (beta) and the new features they provide as part of the Intel oneAPI Base Toolkit (beta).
Intel® Advisor (Beta)
Intel Advisor (beta) is an extended version of Intel Advisor, a tool for code modernization, programming guidance, and performance estimation that supports the DPC++ language on CPUs and GPUs. It provides codesign, performance modeling, analysis, and characterization features for C, C++, Fortran*, and mixed Python* applications.
Intel Advisor (beta) includes:
- Offload Advisor to help you identify high-impact opportunities to offload to the GPU as well as areas that aren’t useful to offload. You can also project performance speedup on accelerators, estimate offload overhead, and pinpoint accelerator performance bottlenecks.
- Vectorization Advisor to help you identify high-impact, under-optimized loops and see what’s blocking vectorization and where it’s safe to force vectorization.
- Threading Advisor to help you analyze, design, tune, and check threading design options without disrupting your normal development.
- Roofline Analysis to help you visualize performance on both your CPU and GPU and see how close you are to the maximum possible performance.
- Intel® FPGA Add-On for oneAPI Base Toolkit (beta) (optional) to help you program these reconfigurable hardware accelerators to speed specialized, data-centric workloads. (Requires installationof the Intel oneAPI Base Toolkit.)
Use the Offload Advisor command-line feature to design code for efficient offloading to accelerators—even before you have hardware. Estimate code performance and compare it with data transfer costs. No recompilation is required.
The Intel Advisor (beta) GPU performance evaluation (Figure 1) produces upper-bound speedup estimates using a bounds and bottlenecks performance model. It takes measured x86 CPU metrics and application characterization as input and applies an analytical model to estimate execution time and characteristics on a target GPU.
The Roofline Analysis feature helps you optimize your CPU or GPU code for compute and memory. Locate bottlenecks and determine performance headroom for each loop or kernel to prioritize which optimizations will deliver the highest performance payoff. (Note that GPU Roofline Analysis is in technical preview.)
Intel® VTune™ Profiler (Beta)
Intel VTune Profiler (beta) is a performance analysis tool for serial and multithreaded applications. It helps you analyze algorithm choices and identify where and how your application can benefit from available hardware resources. Use it to locate or determine:
- The most time-consuming (hot) functions in your application and/or on the whole system
- Sections of code that don’t effectively utilize available processor resources
- The best sections of code to optimize for both sequential and threaded performance
- Synchronization objects that affect the application performance
- Whether, where, and why your application spends time on input/output operations
- Whether your application is CPU- or GPU-bound and how effectively it offloads code to the GPU
- The performance impact of different synchronization methods, different numbers of threads, or different algorithms
- Thread activity and transitions
- Hardware-related issues in your code such as data sharing, cache misses, branch misprediction, and others
The tool also has new features to support GPU analysis:
- VTune GPU Offload Analysis (technical preview)
- GPU Compute/Media Hotspots Analysis (technical preview)
GPU Offload Analysis (Preview)
Use this tool to analyze code execution on the CPU and GPU cores of your platform, correlate CPU and GPU activity, and identify whether your application is GPU- or CPU-bound. The tool infrastructure automatically aligns clocks across all cores in the system so you can analyze some CPU-based workloads together with GPU-based workloads within a unified time domain. This analysis lets you:
- Identify how effectively your application uses DPC++ or OpenCL™ kernels.
- Analyze execution of Intel® Media SDK tasks over time (for Linux targets only).
- Explore GPU usage and analyze a software queue for GPU engines at each moment of time.
GPU Compute/Media Hotspots Analysis (Preview)
Use this tool to analyze the most time-consuming GPU kernels, characterize GPU usage based on GPU hardware metrics, identify performance issues caused by memory latency or inefficient kernel algorithms, and analyze GPU instruction frequency for certain instruction types. The GPU Compute/Media Hotspots analysis allows you to:
- Explore GPU kernels with high GPU utilization, estimate the efficiency of this utilization, and identify possible reasons for stalls or low occupancy.
- Explore the performance of your application per selected GPU metrics over time.
- Analyze the hottest DPC++ or OpenCL™ kernels for inefficient kernel code algorithms or incorrect work item configuration.
Case Study: Using Software Tools to Optimize oneAPI Applications
Now that we know some of the tools and features available in the Intel oneAPI Base Toolkit, let’s try working through an example. In this case study, we look at the usage of Intel VTune Profiler (beta) and Intel Advisor (beta) to optimize an application. We’ll look at several Intel Advisor (beta) features including Roofline Analysis and Offload Advisor to determine the bottlenecks in an application and the regions to offload to an accelerator.
Matrix multiplication is a common operation in many applications. Here’s a sample matrix multiplication kernel:
The algorithm is a triply nested loop with a multiplication and addition in each iteration. Code like this is computationally intensive with many memory accesses. Intel Advisor is ideal for helping you analyze it.
Using Intel Advisor (Beta) to Help Port to GPU
Intel Advisor (beta) has a feature that lets you see the portions of the code that can profitably be offloaded to a GPU. It also can predict the code performance when run on the GPU and lets you experiment based on several criteria. Analyzing DPC++ code with Intel Advisor (beta) requires a two-stage analysis:
The screenshot in Figure 2 shows the CPU run time and the predicted time when run on the accelerator (in this case, a GPU). It shows how many regions were offloaded and the net speedup. You can also see what the offloads are bounded by. In our case, we are 99% bounded by the last-level cache bandwidth (LLC BW).
In the Summary section of the report, you can see:
- The original CPU execution time, the predicted execution time on the GPU accelerator, the number ofoffloaded regions, and the speedup in the “Program metrics” pane.
- What the offloads are bounded by. In our case, the offloads are 99% bounded by the last-level cache (LLC) bandwidth.
- Exact source lines of the top offloaded code regions that will benefit from offloading to the GPU. In our case, there’s only one code region recommended for offload.
- Exact source lines of the “Top non-offloaded” code regions that aren’t recommended for offload for various reasons. In our case, the time spent in the loops is too small to be modeled accurately and one of the loops is outside the code region marked for offloading.
Use this information to rewrite the matrix multiplication kernel in DPC++.
Rewrite the Matrix Multiplication Kernel in DPC++
Intel Advisor (beta) provides the exact source line of the offloaded region, as shown in Figure 3. The tool also recommends loops that don’t need to be offloaded because their compute time is too small to be modeled accurately or they’re outside of a marked region (Figure 4).
Follow these steps to rewrite the matrix multiplication kernel in DPC++ (as shown in the code sample below):
- Select an offload device.
- Declare a device queue.
- Declare some buffers to hold the matrix.
- Submit work to the device queue.
- Execute the matrix multiplication in parallel.
Optimize GPU Usage with Intel VTune Profiler (beta)
Offload Advisor helped us port our CPU kernel to a GPU, yet our initial implementation is far from optimal. We’ll use the GPU offload features of Intel VTune Profiler (beta) to see how effectively we’re using our GPU (Figure 5). GPU offload is showing that our application has an elapsed time of 2.017 seconds and our GPU utilization is 100%. We can also see that matrix multiplication is our hotspot.
By switching to the Graphics and Platform tabs, we can see more details. Intel VTune Profiler (beta) shows a synchronized timeline between the CPU and GPU. GPU offload does indicate that our GPU execution units are stalling, as indicated by the dark red bar in the timeline (Figure 6).
Next, we’ll run the Intel VTune Profiler (beta) GPU Hotspots report to try to identify the source of our low GPU utilization and stalls. Click on the Graphics tab in GPU Hotspots and you can see a high-level diagram of your architecture (Figure 7). Notice that we’re not using the shared local memory (SLM) cache. Also notice that we’re moving around 159.02 GB/s in total.
We’ll try two optimization techniques:
- Cache blocking the matrix
- Using local memory
To implement these techniques, we need to break our matrix into tiles and work on them separately in the SLM cache:
The new architecture diagram shows that this is much more efficient (Figure 8). We’re making use of the SLM: 136.35 GB/s for read and 45.45 GB/s for write.
Click on the Platform tab (Figure 9) to see some additional metrics. You can see that our matrix is stored in 1024×1024 global memory, but we make use of local memory in a 16×16 tile. The elapsed time for our new matrix is 1.22s, a 1.64x improvement.
Intel Advisor GPU Roofline
To see if the GPU version of our matrix multiplication kernel is getting the maximum performance from our hardware, we can use the new GPU Roofline feature. Intel Advisor (beta) can generate a Roofline Model for kernels running on Intel GPUs. The Roofline Model offers a very efficient way to characterize your kernels and visualize how far you are from ideal performance.
The Roofline Model on GPU is a tech preview feature that’s not available by default. Here’s a five-step process to enable it:
- First, ensure that you have a DPC++ code that correctly runs on the GPU. You can easily check which hardware you are running on by doing something like this:
- Since this is a technical preview, you need to enable GPU profiling by setting the following environment variable: export ADVIXE_EXPERIMENTAL=gpu-profiling.
- Next, run the survey with the –enable-gpu-profiling option: advixe-cl -collect survey–enable-gpu-profiling –project-dir –search-dir src:r= — ./myapp param1 param2
- Run the tripcount analysis with the –enable-gpu-profiling option: advixe-cl -collect tripcounts –stacks –flop –enable-gpu-profiling –project-dir –search-dir src:r= — ./myapp param1 param2
- Generate the Roofline Model: advixe-cl –report=roofline –gpu –project-dir –report-output=roofline.html
Once the last step is executed, the file roofline.html will be generated and can be opened in any Web browser (Figure 10).
It’s also possible to display different dots based on which memory subsystem is used for the arithmetic intensity computation (Figure 11).
As you can can see from the roofline chart in Figure 12, our L3 dot is very close to the L3 maximium bandwidth. To get more FLOPS, we need to optimize our cache utilization further. A cache-blocking optimization strategy can make better use of memory and should increase our performance. The GTI (traffic between our GPU, GPU uncore [LLC], and main memory) is far from the GTI roofline, so transfer costs between CPU and GPU do not seem to be an issue.
Freedom to Focus
Intel oneAPI products will provide a standard, simplified programming model that can run seamlessly on the scalar, vector, matrix, and spatial architectures deployed in CPUs and accelerators. It will give users the freedom to focus on their code instead of the underlying mechanism that generates the best possible machine instructions.