How Effective is Your Vectorization?

Gain Insights into How Well Your Application is Vectorized Using Intel® Advisor

Determining how well your application is vectorized is crucial to getting the best performance on your system. In this article, we’ll show how to pinpoint vectorization issues, see how well you’re using your hardware, and optimize performance using Intel® Advisor, which is available in a free, standalone version and as part of both Intel® Parallel Studio XE and Intel® System Studio.

Intel Advisor helps you to see:

  • Which loops are vectorized
  • Data types, vector widths, and instruction sets (e.g., AVX-512, AVX2)
  • How many floating-point and/or integer operations are executed
  • How many instructions were devoted to computation and how many to memory operations
  • Your register utilization
  • How to improve your vectorization
  • And much more

Getting Great Performance

To get top performance out of your application, you need information on how well you’re using all the resources of the system. Intel Advisor’s new and improved summary view (Figure 1) gives you an indication of how well the application is performing as a whole.

Figure 1 – Intel Advisor summary view

You can see the vectorization instruction sets used and some useful performance metrics. This view now includes a program characteristics section, which compares your relative performance to the peak performance obtainable on your system. In Figure 1, notice that the application is using several different instruction sets―something we should investigate. Also notice that the program is getting vectorization efficiency of just 42%. Where did we lose 58% of our efficiency? We can drill down to investigate.

Drilling Down

You can get more detail in the survey and roofline tab (Figure 2). The survey view gives details on a loop-by-loop basis. Focus on the loops where you’re spending the most time, and try to get these loops to vectorize as efficiently as possible. Intel Advisor highlights whether the loop is vectorized and its efficiency. If the compiler wasn’t able to vectorize the loop, Intel Advisor can tell you why. The performance issues column can give you clues as to why efficiency is poor.

Figure 2 – Survey and rooftop tab

Instruction Set Analysis

Instruction set analysis (Figure 3) takes a deep dive into what the compiler did to vectorize your code. It shows the:

  • Vectorization instruction set used
  • Vector widths
  • Data type being operated on

The traits column generally indicates the memory manipulation the compiler had to do to fit your data structure into a vector. These memory manipulations can be indicators of poor efficiency.

Figure 3 – Instruction set analysis

In our example application, the main loop is using Intel® AVX-512, but the vector widths are only 128 and 256. Also, Intel Advisor gives you a warning message if your application seems to be underperforming, and offers tuning advice (Figure 4).

Figure 4 – Warning message

Recompiling to enable the ZMM registers yields the instruction set analysis in Figure 5. Most of our loops now use the complete 512 bytes of the vector registers. In our example, using the ZMM registers improved performance. However, this isn’t always the case. It’s application-specific.

Figure 5 – Instruction set analysis

Using the Middle Part of the Intel Advisor GUI

The tabs in the middle of the Intel Advisor GUI contain a wealth of program information (Figure 6).

Figure 6 – Intel Advisor GUI tabs

The recommendations tab is a great way to get tips to improve performance (Figure 7). For instance, if a loop didn’t vectorize, the vectorization tab can tell you why, along with providing code examples showing how to fix the issue.

Figure 7 – Intel Advisor recommendations tab

Code Analytics

The code analytics tab (Figure 8) gives details about what’s happening in a loop. You can see your performance at a high level or get statistics for all operations and an instruction mix summary.

Figure 8 – Intel Advisor code analytics tab

Statistics for All Operations

You can get statistics for all operations, including floating-point (FLOPS), integer (INTOP), or mixed (INT+FLOAT) operations (Figure 9). This gives you a detailed view of some key performance metrics, showing how many instructions are executing per second. This view also gives you metrics on how well you’re using the memory hierarchy in this loop.

Figure 9 – Statistics for all operations

How Many Operations Are You Executing?

What are the types of instructions in your loop? Are they compute- or memory-based? Intel Advisor can answer these questions, and give you both the static and dynamic instruction count, with the static instruction mix summary (Figure 10). You get the percentage of each instruction you’re executing, so you can see if you’re really using the newest instructions where you should be.

Figure 10 – Static instruction mix summary

Optimizing Vectorization

It’s crucial to optimize the vectorization of your program. Understanding how well your program is vectorized by using a tool like Intel Advisor can help you make sure you’re getting the most out of your hardware.

Related Articles

For more complete information about compiler optimizations, see our Optimization Notice.