# Speeding Algebra Computations with Intel® Math Kernel Library Vectorized Compact Matrix Functions

## Maximizing the Performance Benefits of the Compact Data Layout

Many high-performance computing applications depend on matrix computations performed on large groups of very small matrices. The latest version of **Intel® Math Kernel Library (Intel® MKL)** provides new compact functions that include vectorization-based optimizations for problems of this type.

Intel MKL Compact functions rely on true SIMD (single instruction, multiple data) matrix computations, in which subgroups of matrices are operated on by kernels that abstractly appear as scalar kernels, while registers are filled by cross-matrix vectorization. These functions provide significant performance benefits compared to batched techniques, which exploit multithreading but rely on kernels written for standard data formats.

(Find more detailed information about Intel MKL Batch general matrix-matrix multiplication **here**.) Besides the performance benefits that result from efficient vectorization, Intel MKL Compact functions can benefit from the same parallelization principles as batched techniques because calculations on subgroups can be threaded, offering multiple levels of parallelism to the user.

The latest version of Intel MKL has added six compact functions:

- General matrix-matrix multiply
- Triangular matrix equation solve
- LU factorization (without pivoting)
- Inverse calculation (from LU without pivoting)
- Cholesky factorization
- QR factorization

For ease of use, Intel has also added service functions to facilitate the packing and unpacking of groups of matrices into this format. (See the **Intel MKL Developer Reference** for more detailed information about the Compact API in Intel MKL 2018.)

### Compact Format

Compact functions operate on matrices that are packed into a contiguous segment of memory in an interleaved data layout, called Compact format^{1}. In Compact format, matrices are organized into packs of length V, where V is related to the register length of the underlying architecture and the size of the matrix elements. Each pack is a 3D tensor, with the matrix index incrementing fastest. These Compact packs are then loaded into registers and operated on using SIMD instructions.

**Figure 1** shows the packing of a set of four 3×3, real-precision matrices. The pack length for this example is V=2, resulting in two compact packs.

For complex-precision matrices, the real and imaginary parts are packed separately. Real and imaginary packs alternate in memory; thus, the number of packs for a problem involving complex matrices is twice the number of packs for the same problem involving real matrices. Most arithmetic operations on complex elements can be expressed with register type variables without extra operations (i.e., shuffle operations), making this packing format suitable for complex precision Compact functions.

The pack length V is the quotient of the SIMD vector bit length and the bit size of a matrix element (or the bit size of the corresponding real type, in the case of complex precisions). For example, for double-precision matrices on an architecture with a SIMD vector length of 256 bits, V=4, allowing Compact kernels to operate on four matrices simultaneously through vectorized instructions.

Compact kernels operate on separate packs. If the number of matrices to be packed is not evenly divisible by the pack length, the last pack will only be partially filled, and cannot be processed by intrinsic-based kernels. For optimal performance, the last pack should be padded with additional data so that it’s fully packed. To avoid possible numerical issues such as division by zero, the identity matrix is used as padding, since the identity matrix will not produce numerical errors when operated on by any of the Intel MKL Compact functions. **Figure 2 **illustrates the padding process for three matrices when the pack length is equal to 2.

**Figure 3** shows a simple, Compact version of a matrix-matrix multiplication performing the operation C=A*B for a set of four 3×3, real-precision matrices. Generic (or batched) functions require four matrix-matrix

multiplications to be performed for a problem of this type.

Assuming that the matrices have been packed into Compact format using a pack length of V=2, the Compact version of this problem involves two matrix-matrix multiplications, as illustrated in **Figure 4**.

The elements of the matrices involved in these two multiplications are vectors of length V, which are loaded into registers and operated on as if they were a scalar element in an ordinary matrix-matrix multiplication.

### Compact API

Before calling a Compact function, the input data must be packed in Compact format. After execution, the output data should be unpacked from this Compact format, unless another Compact function will be called immediately following the first. **Figure 5** demonstrates the standard call flow when using Intel MKL Compact

functions (replace ? with the desired single-letter precision designation: s, d, c, z).

Intel MKL provides service functions that facilitate the process of packing and unpacking matrices into and out of Compact format. Two service functions, mkl_get_format_compact **and** mkl_?get_size_compact, will calculate the optimal format for the underlying architecture and the size of the necessary Compact buffer to store the Compact arrays. Users will pass the returned values of these functions as parameters for the packing and unpacking functions mkl_?gepack_compact **and** mkl_?geunpack_compact.

Packing and unpacking matrices to and from Compact format adds additional overhead to calculations. Compact functions offer the greatest performance benefit when the user calls multiple Compact functions on the packed matrices, packing the matrices into Compact format before the first Compact function call and unpacking them after the last.

Intel MKL Compact functions are optimized for 128-, 256-, and 512-bit SIMD registers. Compact packs are processed individually by Compact kernels. There are two types of Compact kernels: reference and intrinsic- based. Reference kernels are implemented using C, without specific CPU optimizations, whereas intrinsic-based kernels are optimized for a specific instruction set based on vector intrinsics. In Intel MKL Compact functions, reference kernels are called in two cases:

**The architecture**does not support SSE2 (Intel® Streaming SIMD Extensions 2) or later.**The native format**for the architecture is not compatible with the format of packed matrices.

In both cases, Compact functions will perform correct computations, but performance of the reference kernels may be lower than what one would expect from intrinsic-based kernels.

Calculations on separate Compact packs are inherently independent. Thus, Compact functions are easily threaded. Intel MKL Compact functions implement this naïve form of parallelism behind the scenes, offering improved performance results when calling Compact functions for multiple threads. **Figure 6** compares two approaches for calculating the matrix inverse of many matrices. On the left, calculations are parallelized through an OpenMP* loop over all matrices. The code on the right takes advantage of vectorization and parallelism within the Compact API.

/* OpenMP loop */ #pragma omp parallel for for (i = 0; i < number_of_matrices; i++) { /* Perform LU Factorization */ call mkl_dgetrfnp /* Perform Inverse Calculation */ call mkl_dgetrinp }

/* Pack from pointer-to-pointer to Compact format */ call mkl_dgepack_compact /* Perform Compact LU Factorization */ call mkl_dgetrfnp_compact /* Perform Compact Inverse Calculation */ call mkl_dgetrinp_compact /* Unpack from Compact to pointer-to-pointer format */ call mkl_dgeunpack_compact

### Applications

There are many potential applications for Compact BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra Package) functions. Computer vision requires dense linear algebra operations on large groups of very small matrices. For example, anomaly detection in images requires simultaneously solving thousands of dense linear systems using their Cholesky factorizations. These factorizations are independent, and thus are strong candidates for speedup with Compact functions.

Partial differential equation (PDE)-based simulations use a discretization over a mesh that is represented using a block sparse matrix, where each block is a mesh entity (such as a node, edge, face, or cell center). For example, a 3D compressible fluid dynamics model using the ideal gas model uses 5×5 blocks. In a typical PDE-based application, an iterative linear solver then performs a sequence of matrix-vector and matrix-matrix products. Operations on and within the matrices can be performed in parallel. In Designing Vector-Friendly Compact BLAS and LAPACK Kernels (Kim et al., 2017), the authors demonstrate up to 6X application speedup of a linear solver for compressible fluid dynamics simulation using Compact matrix-matrix multiplication, triangular solve, and LU factorization.

### Performance Results

**Figures 7** and **8** show the performance improvement for general matrix-matrix multiplication (GEMM) and non-pivoting LU factorization of a general matrix (GETRFNP). The results are measured against calls to the generic MKL functions.

The packing and unpacking functions add additional overhead, which is mitigated by calling multiple Compact functions between the calls to pack and unpack. A typical use case is to calculate the inverse from a non-pivoting LU factorization. As shown in **Figure 9**, even when including overhead in performance the Intel MKL Compact functions still provide consistently good speedup, with some sizes and numbers of matrices demonstrating up to 4X speedup compared to calls to the generic Intel MKL functions.

### Speeding Algebra Computations

Storing data in non-standard layouts that allow for cross-matrix vectorization can provide a significant speedup in BLAS and LAPACK functions for small-sized matrices. The new Compact functions in Intel MKL 2018 support and maximize the performance benefits of the Compact data layout.

### References

1. K. Kim, T. C. (2017). “Designing Vector-Friendly Compact BLAS and LAPACK Kernels.” Proceedings of SC’17 Conference.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.