Building Fast Data Compression Code for Cloud and Edge Applications

How to Optimize Your Compression with Intel® Integrated Performance Primitives (Intel® IPP)

Finding efficient ways to compress and decompress data is more important than ever. Compressed data takes up less space and requires less time and network bandwidth to transfer. In cloud service code, efficient compression cuts storage costs. In mobile and client applications, high-performance compression can improve communication efficiency―providing a better user experience.

However, compression and decompression consume processor resources. And, especially in data-intensive applications, this can negatively affect the overall system performance. So an optimized implementation of compression algorithms plays a crucial role in minimizing the system performance impact. Intel® Integrated Performance Primitives (Intel® IPP) is a library that contains highly optimized functions for various domains, including lossless data compression. In this article, we’ll discuss these functions and the latest improvements. We will examine their performance and explain how these functions are optimized to achieve the best performance. We will also explain how applications can decide which compression algorithm to use based on workload characteristics.

Intel® IPP Data Compression Functions

The Intel IPP Data Compression Domain provides optimized implementations of the common data compression algorithms, including the BZIP2*, ZLIB*, LZO*, and a new LZ4* function, which will be available in the Intel IPP 2018 update 1 release. These implementations provide “drop-in” replacements for the original compression code. Moving the original data compression code to an Intel IPP-optimized code is easy.

These data compression algorithms are the basic functions in many applications, so a variety of applications can benefit from Intel IPP. A typical example is Intel IPP ZLIB compression. ZLIB is the fundamental compression method for various file archivers (e.g., gzip*, WinZip*, and PKZIP*), portable network graphics (PNG) libraries, network protocols, and some Java* compression classes. Figure 1 shows how these applications can adopt the optimized ZLIB in Intel IPP. Intel IPP provides fully compatible APIs. The application simply needs to relink with the Intel IPP ZLIB library. If the application uses ZLIB as a dynamic library, it can be easily switched to a new dynamic ZLIB library built with Intel IPP. In the latter case, relinking the application is not required.

Figure 1 – Intel® IPP provides drop-in optimization for ZLIB* functions

In recent releases, the Intel IPP data compression domains extended several functionalities and increased compression performance:

  • Support for LZ4 data compression will be included in Intel IPP 2018 beta update 1, available soon.
  • Increased LZO compression performance by 20 to 50 percent. Added level 999 support in LZO decompression.
  • A new fastest ZLIB data compression mode, which can improve performance by up to 2.3x compared to ZLIB level 1 performance.
  • Trained Huffman tables for ZLIB compression let users improve compression ratio in the fastest mode.

Intel IPP Data Compression Optimization

How does Intel IPP data compression perform the optimization and achieve top performance? Intel IPP uses several different methods to optimize data compression functionality. Data compression algorithms are very challenging for optimization on modern platforms due to strong data dependency (i.e., the behavior of the algorithm depends on the input data). Basically, it means that only a few new CPU instructions can be used to speed up these algorithms. The single instruction, multiple data (SIMD) instructions can be used here in limited cases, mostly pattern searching.

Performance optimization of data compression in Intel IPP takes place in different ways and at different levels:

  • Algorithmic optimization
  • Data optimization
  • New CPU instructions
  • Intel IPP data compression performance

Algorithmic Optimization

At the highest level, algorithmic optimization provides the greatest benefit. The data compression algorithms are implemented from scratch in Intel IPP to process input data and to generate output in the most effective way. For example, selecting proper starting conditions for input data processing can bring a performance gain from the very beginning. Let’s look at the operation of searching for patterns in input data. If you set a minimum match length of two to three bytes, you will detect more matches―and obtain a better compression ratio. On the other hand, you will drastically increase input data processing time. So, the minimum match length should balance compression ratio and compression performance. This is mostly the result of statistical experiments and then careful planning of condition checks and branches in the code. In all cases, it is more effective to put checks for the most probable situations at the beginning of your code so that you will not have to spend time on conditional operations with a low probability of occurring.

Data Optimization

Careful planning of internal data layout and size also brings performance benefits. Proper alignment of data makes it possible to save CPU clocks on data reads and writes. In most CPUs, reading/writing executes faster on aligned data than on nonaligned data. This is especially true when memory operations are executed on data elements longer than 16 bits. Another factor affecting optimization is data size. If we keep internal data structures―arrays, tables, and lists― small, we can often reuse the CPU data cache, increasing overall performance.

New CPU Instructions

The Intel® Streaming SIMD Extensions 4.2 (Intel® SSE 4.2) architecture introduced several CPU instructions, which can be used in data compression algorithms according to their nature. More opportunities came with the Intel® Advanced Vector Extensions 2 instruction set. First, the CRC32 instruction can be used to compute the checksum of input data for data integrity. It is a key part of many data compression standards. The CRC32 instruction can also be used to implement a hash function, another key part of compression algorithms. Also, bit manipulation instructions (BMI) are used to speed up pattern matching functions. Finally, reading/writing data by 32-, 64-, 128-, or 256-bit portions saves memory bus cycles, which also improves performance.

Intel IPP Data Compression Performance

Each Intel IPP function includes multiple code paths, with each path optimized for specific generations of Intel® and compatible processors. Also, new optimized code paths are added before future processors are released, so developers can just link to the newest version of Intel IPP to take advantage of the latest processor architectures.

Figure 2 shows performance results with the Intel IPP ZLIB compression/decompression library. Compared with the open-source ZLIB code, the Intel IPP ZLIB implementation can achieve a 1.3x to 3.3x performance improvement at different compression levels. At the same time, Intel IPP provides fully compatible APIs and compression results.

Configuration info – Software versions: Intel® Integrated Performance Primitives (Intel® IPP) 2017, Intel® C++ Compiler 16.0. Hardware: Intel® Core™ processor i7-6700K, 8 MB cache, 4.2 GHz, 16 GB RAM, Windows Server* 2012 R2. Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors.

Figure 2 – Performance results with the Intel® IPP ZLIB* compression and decompression library

Choosing the Compression Algorithms for Applications

Since data compression is a trade-off between compression performance and compression ratio, there is no “best” compression algorithm for all applications. For example, the BZIP2 algorithm achieves good compression efficiencies but because it is more complex, it requires significantly more CPU time for both compression and decompression.

Figure 3 is a summary of the common Intel IPP functions on compression performance and ratio. Among these algorithms, the LZ4 is the fastest compression method―about 20x faster than BZIP2 on the standard corps compression data. But BZIPs could achieve the best compression ratios. ZLIB provides a balance between performance and efficiency, so the algorithm is chosen by many file archivers, compressed network protocols, file systems, and lossless image formats.

Figure 3 – Common Intel® IPP functions and compression ratio

How can a user’s application decide the compression algorithm? Often, this involves a number of factors:

  • Compression performance to meet the application’s requirement
  • Compression ratio
  • Compression latency, which is especially important in some real-time applications
  • Memory footprint to meet some embedded target application requirements
  • Conformance with standard compression methods, enabling the data to be decompressed by any standard archive utility
  • Other factors as required by the application

Deciding on the compression algorithms for specific applications requires balancing all of these factors, and also considering the workload characteristics and requirements. There is no “best” compression method for all scenarios.

Consider a typical data compression usage example. Suppose we have a file download scenario that requires transferring data from servers to a client application. To maximize performance, lossless data compression is performed on the servers before transmission to the client. In this situation, the client application needs to download the data over the network and then decompress it. Because overall performance depends on both transmission and decompression performance, we need to carefully consider workload and network characteristics. Some highcompression algorithms, like BZIP2, could reduce data transmission. On the other hand, BZIP2 decompression takes longer to compute than other, less sophisticated methods.

Figure 4 shows the overall performance of the common Intel IPP compression algorithms at different network speeds. Tests were run on the Intel® Core™ i5-4300U processor. When the code runs on some lower-bandwidth networks (e.g., 1 Mbps), an algorithm that can achieve high compression ratios is important to reduce the data/time for transmission. In our test scenario, Intel IPP BZIP2 achieved the best performance (approximately 2.7x speedup) compared to noncompressed transmission. However, when the data is transferred by a higher-performance network (e.g., 512 Mbps), LZ4 had the best overall performance, since the algorithm is better balanced for compression and transmission time in this scenario.

Configuration info – Software versions: Intel® Integrated Performance Primitives (Intel® IPP) 2018 Beta Update 1, Intel® C++ Compiler 16.0. Hardware: Intel® Core™ processor i7-6700K, 8 MB cache, 4.2 GHz, 16 GB RAM, Windows Server* 2012 R2. Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors.

Figure 4 – Data transfer with and without compression

Optimized Data Compression

Data storage and management are becoming high priorities for today’s data centers and edgeconnected devices. Intel IPP offers highly optimized, easy-to-use functions for data compression. New CPU instructions and an improved implementation algorithm enable the data compression algorithms to provide significant performance gains for a wide range of applications and architectures. Intel IPP also offers common lossless compression algorithms to serve the needs of different applications.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex.