Superior Machine Learning Performance on the Latest Intel® Xeon® Scalable Processors

Intel Gives Data Scientists the Performance and Ease-of-Use They Need

The newest 3rd generation Intel® Xeon® Scalable processors enhance artificial intelligence (AI), cloud computing, security, and many other areas. Intel has optimized an array of software tools, libraries, and frameworks so that applications can easily take advantage of the latest hardware advances. The results are impressive. This blog focuses on the popular scikit-learn machine learning (ML) library and Intel® Extension for Scikit-learn*.

We previously demonstrated performance leadership of 2nd generation Intel® Xeon® Scalable processors over Nvidia and AMD processors by changing just two lines of code.

Here I’ll show that Intel Extension for Scikit-learn delivers 1.09x to 1.63x speedup on the latest Intel Xeon Scalable processors over previous generations, a range of 0.65x to 7.23x speedup compared to NVIDIA A100, and a range of 0.61x to 2.63x speedup compared to AMD Milan.

Intel® Extension for Scikit-learn*

Intel Extension for Scikit-learn (previously known as daal4py) contains drop-in replacement functionality for the stock scikit-learn package. You can take advantage of its performance optimizations by adding just two lines of code before the usual scikit-learn imports:

from sklearnex import patch_sklearn
patch_sklearn()

# the start of the user’s code
from sklearn.cluster import DBSCAN

Intel Extension for Scikit-learn is part of Intel® oneAPI AI Analytics Toolkit (AI Kit), which provides a consolidated package of Intel’s latest deep learning and ML optimizations. You can download it from several distribution channels: Docker Container, YUM, APT, and Anaconda. Alternately, you can also download just the Intel Extension for Scikit-learn component using PyPI or Conda Forge:

pip install scikit-learn-intelex

conda install scikit-learn-intelex -c conda-forge

Intel Extension for Scikit-learn uses the Intel® oneAPI Data Analytics Library (oneDAL) to achieve its acceleration. The library enables all the latest vector instructions, such as Intel® Advanced Vector Extensions (Intel® AVX-512). It also uses cache-friendly data blocking, fast BLAS operations with the Intel® oneAPI Math Kernel Library (oneMKL), and scalable multithreading with the Intel® oneAPI Threading Building Blocks (oneTBB).

Performance Leadership

I compared the performance of several ML algorithms in Intel Extension for Scikit-learn on 2nd and 3rd Gen Intel Xeon Scalable processors and observed 1.09x to 1.63x speedups in training and inference (Figure 1).

Figure 1. Performance improvement of 3rd Gen Intel Xeon Scalable processors over 2nd Gen

To assess competitive performance, I compared 3rd Generation Intel Xeon Scalable processors to the latest NVIDIA A100 and AMD Milan processors. The new Intel Xeon Scalable processors demonstrated performance leadership across a variety of ML algorithms: 0.65x to 7.23x speedup compared to NVIDIA A100 (Figure 2) and 0.61x to 2.63x speedup compared to AMD Milan (Figure 3).

Figure 2. Speedup of 3rd Gen Intel Xeon Scalable processors (using Intel Extension for Scikit-learn) over NVIDIA A100 (using RAPIDS cuML)

 

Figure 3. Speedup of new 3rd Gen Intel Xeon Scalable processors over AMD Milan using Intel Extension for Scikit-learn on both processors

Intel’s Most Advanced Data Center Processor

The 3rd Generation Intel Xeon Scalable processors feature a flexible architecture with built-in AI acceleration via Intel® Deep Learning Boost technology plus a host of other enhancements:

  • Faster memory. The number of memory channels per socket increased from six to eight, and the maximum frequency of memory increased from 2933MHz to 3200MHz. As a result, DRAM memory bandwidth increased up to 1.45x. Data analytics workloads are often DRAM-bound because many operations must be performed in-memory, so 3rd Generation Intel Xeon Scalable processors offer a significant improvement for these workloads.
  • More cores. Top-bin 3rd Generation Intel Xeon Scalable processors have 40 cores per socket, providing greater multithreaded data processing.
  • Advanced microarchitecture. Instructions per cycle (IPC) improved from four to five, and the core of the new processor has ten execution ports instead of eight. In addition, new instructions were introduced to improve single-core performance, e.g.: AVX512 BITALG, AVX512 VBMI2, and others.
  • Larger caches. The Intel Xeon Platinum 8380 processor provides 60MB of last-level cache (LLC): 56% more than the Intel Xeon Platinum 8280L (38.5MB). L2 cache increased from 1MB to 1.25MB per core, and L1 cache increased from 32KB to 48KB per core. Some ML algorithms spend most of their time processing data residing in caches, so caching improvements can have a significant impact on performance.
  • New level of security. ML algorithms often process confidential data, so new Intel Xeon Scalable processors provide hardware-based memory encryption with granular control via Intel® Software Guard Extensions (Intel® SGX).

The optimizations in Intel Extension for Scikit-learn plus the advanced capabilities of 3rd Generation Intel Xeon Scalable processors deliver superior performance for ML and data analytics workloads. This allows you to run enterprise applications on a single architecture, optimizing total cost of ownership for mixed workloads and bringing innovative solutions to market faster.

Hardware and Software Benchmark Configurations

All configurations were tested by Intel.

Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex.