Delving into the Mysteries of OpenMP Vectorization Support
How to Use the OpenMP SIMD Directive to Tell the Compiler What Code to Vectorize
OpenMP was created over 20 years ago to provide a standard for compiler-directed threading that’s easier to use and understand than an explicit threading library like pthreads. Many of the details about creating and destroying, managing, and assigning computations to threads was abstracted away into compiler directives. Modern CPUs have vector registers that allow the same instruction to be executed on multiple pieces of data. Such instruction-level parallelism greatly improves performance and power efficiency. But, as with thread-level parallelism, the compiler sometimes needs help getting it right. The SIMD (single-instruction, multiple data) directive and clauses were added to OpenMP to give programmers a way to tell the compiler what computations can be vectorized.
Simple Vectorization Example
The code in Figure 1 is used to compute the value of π. This is done by calculating the area under the curve of the function 4.0 / (1.0 + x2) for x between 0 and 1. The general process used in the code is known as numerical integration or Riemann sums.
If you think back to your linear algebra course, you should recall the midpoint rule of estimating the area under the curve between fixed values on the x-axis. The interval is divided into some number of equal-sized partitions. At the midpoint of each partition, the value of the function is computed. The width of each partition is multiplied by the functional value of the midpoint (or height) to yield the area of a rectangle mostly under the curve. The sum of the areas of all the rectangles across the interval will yield a value that’s close to the actual numerical answer.
Depending on the chosen width, there will be some parts of the rectangle that are outside the curve, and some area under the curve that’s not within a rectangle. However, as the width decreases, or as the number of rectangles increases, the more precise the estimate will be.
Notice that the loop in Figure 1 generates a set of values, one at a time, through the loop iterator, i. The x variable holds the midpoint value, which is used to compute the functional value (the height of the rectangle at the midpoint). This is to the running total sum. Once the loop is complete, the width of all the rectangles, which is fixed at width, is multiplied by the sum of heights to compute the area. (The actual area multiplication could have been done within the loop before adding the value to sum, but, as an optimization, it was pulled out of the loop and performed once rather than hundreds of thousands or millions of times.)
Each iteration of the loop computing a rectangle’s area is independent. Thus, the loop could be threaded with OpenMP and the partial answers generated within each thread summed together to arrive at an answer. But that’s a topic for another article. The aim of this article is to show how this loop can be vectorized.
The loop body is performing the same computations over and over again on a range of values generated by each loop iteration. These are an addition and multiplication to compute x and a multiplication, addition, and division added to sum. If sets of those values could be loaded into a vector register, those computations could be executed in fewer instruction cycles. Just for the sake of oversimplification, a back-of-the-envelope calculation shows each loop iteration above performing six arithmetic operations 10 million times. By packing four or eight values into one register and performing the arithmetic operation on all of those values in a single instruction, the number of instructions executed would go from 60 million scalar instructions to 15 million or 7.5 million vector instructions.
Simple Use of OpenMP SIMD Directives
Since vector registers need to be populated with multiple values, the basic OpenMP SIMD directive omp simd is used to vectorize loops that are “generating” multiple values and performing the same computations on each. Figure 2 shows the loop from Figure 1 with the added OpenMP directive. This informs the compiler that the computations within the loop can be converted from scalar to vectorize operations.
Each num_rects iteration can be “bundled” together (each with the corresponding values of the loop iterator, i) into a logical iteration where the loop body’s operations are carried out on each of the bundled iterations through supported SIMD hardware. The OpenMP standard calls these bundles of iterations a “SIMD chunk.” The number of serial iterations that are placed into a SIMD chunk will depend on:
- The hardware and instruction set support for vector operations at execution
- The size of the values being operated on
- Other implementation-specific factors
Table 1 shows some timing results for the two versions of the Riemann sums code. There’s an obvious performance advantage to vectorizing this code.
Table 1. Timing results of serial and vectorized Riemann sums code
There are several clauses we can add to the SIMD directive. If you’re familiar with OpenMP (and as seen in the example above), you can use the private and reduction clauses as needed. There are other, more specific, clauses used with the SIMD directive that can give better hints to the compiler about memory layout/access patterns that would produce better code or overcome some of the compiler’s conservative tendencies (e.g., the safelen clause discussed below). This article doesn’t go into all of the available clauses. Consult the OpenMP specification for complete details.
Advantages of the Intel® Compiler
Before going into more details about OpenMP SIMD support, a confession about the results presented in Table 1 is in order. The Intel® Compiler has been improved over the years and benefits from intimate knowledge of the vector hardware changes and currently supported vector instructions in Intel CPUs—so much so that the loop in Figure 2 really doesn’t need the hints provided by the OpenMP SIMD directive to recognize the vector possibilities and to vectorize the code. The confession is that the serial code was compiled with optimization disabled (via the -O0 flag) to prevent the compiler from automatically vectorizing or parallelizing the loop.
The compiler optimization report provides insights into what the compiler finds on its own with regard to vector and parallel optimizations. Plus, for those loops we assume can be vectorized, we can also use the generated optimization reports to determine why the compiler decided not to vectorize. Then, we can reexamine the code and consider transformations to allow vectorization. This could be refactoring the code to better order the operations or adding OpenMP SIMD directives and clauses to assure the compiler that vectorization is safe.
As a quick example, Figure 3 shows the compilation command on the original serial code (Figure 1) to determine what the compiler sees with respect to vectorization.
This will generate a pi.optrpt output file containing the information shown in Figure 4.
In the actual code that was compiled, line 16 is the for-loop and line 17 is the assignment of x. (I tried casting the iteration variable, i, to double but the emulation comment remained.) Thus, the Intel® compiler can identify code that can be safely vectorized without any programmer intervention.
If you suspect that a loop should be vectorizable, you can try adding OpenMP SIMD directives, recompiling with the -fopenmp flag, and looking at the console output from the compiler. Even with SIMD directives, if the compiler is unable to safely vectorize the loop, it will give a warning message along the lines of Figure 5.
At this point, use the -qopt-report option to request an optimization report from the compiler to determine why it was unable to vectorize the code. With this information, you can reevaluate the code with an eye toward:
- Refactoring it to remove the impediment
- Adding some clauses to the directive that assure the compiler that vectorization is saf
- Examining the loop for other possible optimizations
[Editor’s note: See Making the Most of Intel® Compiler Optimization Reports in this issue for a more in-depth discussion of compiler optimization reports and the information they contain.]
The safelen Clause
Now let’s consider the function shown in Figure 6.
This code is taking two floating point arrays (A, B), some scaling factor (scale), and start and end (s, e) positions of the first array to be updated by adding a scaled value from a corresponding element of the second array. This is a floating-point, multiply-add operation that’s typically vectorizable if there’s no overlap between the A and B arrays.
If B is part of A, but a part that has higher index values (i.e., “further” down in the A array than the current element being computed), the code has an anti-dependency, or a write-after-read dependency. For example, if B is A offset by two elements, consider this (size 4) SIMD chunk of the inner loop above:
There’s no problem with vectorization because the current values of A[k+2] to A[k+5] are copied into a vector register, multiplied by scale, then added to the current values of A[k] to A[k+3] (also loaded into a vector register).
The third case, where B is A offset by a negative value, the code has a true dependency, or read-after-write dependency, that can cause problems if vectorized. Consider the case of a -2 offset. The SIMD chunk shown above would unwind to:
It should be obvious that there’s a problem reading the original values of A[k] and A[k+1] before the updated (correct) values have been written into them. This is where knowing the safe minimum length of SIMD chunks is given to the compiler through the safelen clause, if the minimum value of the offset is known. If the gap between elements will never be less than, say, 12 between A and B (whenever those two arrays are actually part of the same array), the original function given in Figure 6 can be vectorized with OpenMP, as shown in Figure 7. The compiler now has the assurance to construct vector code that will be safe for vector lengths of 12 elements or less.
Note that using the Intel compiler on the original scale() function (Figure 6) will generate a vectorized version of the code. This is because there’s a loop contained in the code and it appears to be safe. However, since there’s a chance there could be a true dependency, the compiler also generates the serial version of the loop. At runtime, the addresses of the A and B parameters are compared, and the correct choice of compiled code will be executed.
Return of the Simple Example
As I mentioned at the end of the previous section, the Intel compiler will generate vectorized code when it knows it’s safe to do so. Otherwise, it will generate serial code. But, what about a loop that contains a call to a user-defined function? If that function isn’t inlined, there’s no way for the compiler to know if the loop can be vectorized safely.
Figures 8 and 9 show a restructured version of the Riemann sums code that’s been split into two files. Figure 8, pi_func.cc, is the computation of the functional value at the ith midpoint within the bounds of the curve for which we’re computing the area. Figure 9, pi_driver.cc, is the main body of the Riemann sums driver code.
One advantage of constructing the code this way is the ability to substitute almost any computational function, which will make the driver code a more general Riemann sums algorithm. The print statement would need to be changed and the function name updated to something more generic; then, whichever computational function is desired for processing is simply linked to the driver.
Without a loop in the pi_func.cc file (Figure 8), the compiler is unable to vectorize that computation. With the user-defined pi_func() call in the loop (Figure 9), the compiler has no idea whether the code can be vectorized. In fact, compiling the pi_func.cc file with the report-generating options for vectorization will give an empty report.
The declare simd directive can be used to tell the compiler to vectorize code that the programmer knows is safe. The first step is to add the pragma shown in Figure 10 to the function pi_func().
This is also done at the extern declaration in the driver code file (Figure 11). Then, to complete the process, the loop containing the call to pi_func() needs to be denoted as vectorizable in much the way it was in the original example.
The compilation process of the pi_func() code will create a vectorized version of the function, say vec_pi_func(). Within the driver function, the loop body will be altered to call a vectorized version of the function vec_pi_func(). Essentially, if the vector length supported by the hardware and the data style was four, the four calls of pi_func() with parameter values i, i+1, i+2, and i+3 would be substituted for a single call to the vectorized version, vec_pi_func(), with the four values copied into a vector register for vector execution.
Table 2 shows some timing results from running the two versions of the separated Riemann sums code with the function value calculation set off in a separate file. The vectorized version makes use of the declare simd clause for vectorization. The vectorized version in this example is demonstrably faster than the serial code, though not as much as the original version shown at the outset.
Table 2. Execution times for serial and vectorized compilations of Riemann Sums code with userdefined function
Reducing Execution Time
For those computations that can be executed at the same time, a vectorized solution, on hardware that supports such execution, will reduce execution time by actually making multiple computations within the same machine cycles. This even applies to parallelized code. (Try this exercise: Take the original Riemann sums code given [Figure 1] and make the for-loop both threaded and vectorized.)
In straightforward cases, the Intel compiler will be able to automatically vectorize loops of computations at the appropriate compiler optimization levels. The optimization reports from the Intel compiler will give you information about what was vectorized and what wasn’t (and why it wasn’t vectorized). In the latter case, if you believe the compiler has been too conservative, you can restructure the code or add OpenMP SIMD directives and clauses to give the compiler hints about what is safe to vectorize. For the full range of SIMD pragmas and clauses, consult the OpenMP documentation for the version supported by your compiler.
- All compilation and execution were done on the Intel® DevCloud.
- The Intel compiler version used was 19.1 20200306.
- Executions were done in an interactive session on Intel® Xeon® Platinum processors with 384 GB memory. The session was launched with the following command: qsub -I -l nodes=1:plat8153:ppn=2 -d.