Uncovering More Tuning Opportunities with Intel® Compiler Optimization Reports
Code Generation, Interprocedural Optimization, Floating-Point Precision, and More
The compiler reports generated by the Intel C, C++, and Fortran compilers provide useful information for optimizing code. Our previous articles in The Parallel Universe (see issues 41 and 42) discussed compiler reports for loop transformations and vectorization. In this article, we’ll cover code generation optimizations, interprocedural optimization (IPO), inlining, data alignment, OpenMP and auto-parallelization, and floating-point precision reports. These reports highlight the code generation done by the compiler and help us get better performance by doing “last mile” code changes and optimizations.
Generating Optimization Reports
For the Intel® C/C++ compiler on Linux and macOS, the -qopt-report[=n] option requests an optimization report. Use the /Qopt-report[:n] option on Windows.) The ‘n’ is optional and indicates the level of detail in the report: 0 (no report) through 5 (detailed report). In this article, we’ll discuss reports generated using -qopt-report=5 (detailed report).
The code generation optimizations section of the compiler report can help in determining the number of hardware registers available, used, and spills, fills. The simple C++ vector addition shown in Figure 1 gives the compiler optimization report shown in Figure 2. It shows detailed register usage for the input arguments, global and local variables, register spills, and stack usage.
Registers are the closest memory to the ALUs in a CPU, so reads and writes to registers are fast. For best performance, programmers should ensure that most of the variables in a routine can be accommodated in these registers. However, the number of registers is limited, so the compiler attempts to generate code for optimal allocation of these registers.
If a routine uses more variables than available registers, some variables may need to be stored and reloaded from the stack. This is known as register spilling. Sometimes it’s unavoidable, but one can optimize register usage by applying loop optimizations like loop fission when possible. If the compiler optimization report indicates a significant number of spills, this is referred to as a high register pressure performance issue. Recommended optimizations for register pressure include avoiding loop unrolling, loop fission, and having the compiler generate scalar plus vector code.
Take a look at the pseudocode shown in Figure 3. It was adapted from a Lattice Boltzmann Method (LBM) application. The compiler optimization reports with and without IPO are shown in Figures 4 and 5. The compilation units, lbm_file1.c and lbm_file2.c, contain functions func_grids and func2, respectively. func2 invokes func_grids in a time-step loop.
One of the most common optimizations for better compiler code generation and performance is function inlining, but it can be done only when the callee (function definition) and caller (invocation) are in a single compilation unit. This isn’t the case for the LBM example in Figure 3. However, the Intel compiler can do interprocedural optimization (the -ipo and /Qipo compiler options on Linux and Windows, respectively). As shown in Figure 4, the call to func_grids prevents vectorization of the loop in func2. Using IPO gives the compiler an opportunity to vectorize such loops/function calls by inline optimization (Figure 5).
One of the most common compiler optimizations is to inline a called function within the caller. The smaller a function’s size, the more likely it is to be inlined. The compiler generally tries to limit the “code bloat” caused by inlining, but users can override the compiler’s conservative tendencies by changing the options listed in Figure 6.
Floating-Point Model and Precision
Scientific users usually want to maintain high precision in their computations. This is typically accomplished by using 64-bit instead of 32-bit floating-point datatypes. However, using lower precision datatypes when appropriate can improve performance. In addition, compiler optimizations that affect numerical reproducibility/consistency are sometimes prevented using the -fp-model precise option (fp:precise on Windows).
The code in Figure 7 computes the square norm of a 2D array. If this code is compiled with the precise floating-point model, the compiler is unable to vectorize the inner loop (Figure 8). A more relaxed floating-point model (the default) allows the compiler to do more aggressive optimization, as suggested in the report.
OpenMP and Auto-Parallelization
Another speedup opportunity on modern processors is parallelism. The Intel compilers support parallelism when the compiler can determine that it is safe (auto-parallelization) or when the parallelism is expressed using OpenMP (Figure 9). The compiler optimization reports which loops are parallelized when the OpenMP (-qopenmp and /Qopenmp on Linux and Windows, respectively) and auto-parallelization (-parallel and /Qpar on Linux and Windows, respectively) options are used (Figure 10).
The Intel compilers provide a rich set of features, performance optimizations, and support for the latest language standards. Users are encouraged to try the latest compiler and experience application performance improvements that are possible by changing a few compiler options. The compiler reports discussed in this and previous articles help to understand what the compiler is doing. The examples shown in these articles helps drive this point home.