K-means Acceleration with 2nd Generation Intel® Xeon® Scalable Processors

Intel’s Hardware and Software Stack for Big Data

The amount of data humans produce every day is growing exponentially―and so is the need for high-performance, scalable algorithms to process and extract benefit from all this data. Intel puts a lot of effort into building effective data analytics tools for our platforms by optimizing not only hardware, but also software. In recent years, Intel has built a powerful data analytics and machine learning (ML) software and hardware stack that includes both optimized frameworks1 and libraries.2

From the hardware side, high performance and scalability are provided by Intel® Xeon® Scalable processors3 (the 2nd generation also contains Intel® Deep Learning Boost [Intel® DL Boost] technology], Intel® NervanaTM Neural Network Processors [Intel® NervanaTM NNP], Intel® FPGAs, and Intel® Movidius Neural Compute Stick4).

Intel-Optimized scikit-learn* and Intel® Data Analytics Acceleration Library (Intel® DAAL)

One of Intel’s optimized frameworks for classic ML algorithms, scikit-learn*5, is part of the Intel® Distribution for Python* (IDP).6 The IDP scikit-learn uses the Intel® DAAL7 library underneath to obtain high performance on Intel® architectures. Intel DAAL is an open-source8 data analytics library optimized for Intel® architectures ranging from mobile (Intel® Atom® processors) to data centers (Intel® Xeon® processors). Intel DAAL provides C++, Java*, and Python APIs, as well as newly introduced Data Parallel C++ (DPC++), which is part of oneAPI,9 a unified programming model for effective development on different architectures (CPU, GPU, and others).

K-means is one of the accelerated ML algorithms in IDP scikit-learn.

The K-means Algorithm

K-means, a popular clustering algorithm, is used in astronomy, medicine, market segmentation, and many other areas. It’s one of the accelerated ML algorithms implemented in Intel DAAL. k-means is both simple and powerful. Its objective is to split a set of N observations into K clusters. This is achieved by minimizing inertia (i.e., the sum of squared Euclidian distances from observations to the cluster centers [centroids]). The algorithm is iterative, with two steps in each iteration:

  1. For each observation, compute the distance from it to each centroid, and then reassign each observation to the cluster with the nearest centroid.
  2. For each cluster, compute the centroid as the mean of observations assigned to this cluster.


Repeat these steps until one of the following happens:

  1. The number of iterations exceeds its predefined maximum.
  2. The algorithm converges (i.e., the difference between two consecutive inertias is less than a predefined threshold).


Different initialization methods are used to get initial centroids for the first iteration. It can select random observations as initial centroids, or use more complex methods such as k-means++.10

Intel DAAL versus RAPIDS cuML* on Distributed k-means

To show how well Intel DAAL accelerates k-means, we compared it to RAPIDS cuML, which claims to be 90x faster than CPU-based implementations.11 We also compared cuML to scikit-learn from IDP, which uses Intel DAAL underneath.

From Amazon Web Services Elastic Compute Cloud*12 (AWS EC2*), we used the following instances (nodes) to measure performance:

  • Multiple (up to 8) AWS EC2 c5.24xlarge instances with 2nd Generation Intel Xeon Scalable processors
  • One AWS EC2 p3dn.24xlarge instance with eight Nvidia V100* GPUs

We chose the maximum number of c5.24xlarge instances to be eight because the cost per hour of eight c5.24xlarge instances is approximately the same as one p3dn.24xlarge instance. We used a synthetic dataset with 200 million observations, 50 columns, and 10 clusters, which is as much as the V100’s memory can store. A similar dataset with 300 million observations instead of 200 caused a RAPIDS memory error (RMM_ERROR_OUT_OF_MEMORY). CPU-based systems can typically process much larger datasets. (The code for dataset generation is shown in the Code Examples section at the end of this article.)

We chose the same initialization method in all measured cases for an apples-to-apples comparison, and we chose the float32 datatype since it gives sufficient accuracy and fits in memory.

Figure 1 and Table 1 show that starting from four Intel Xeon Scalable processor nodes, Intel DAAL outperforms cuML on eight V100 GPUs. Even one node with two Intel Xeon Scalable processors is only 40% slower than eight V100 GPUs. Also, it can hold and process the whole dataset, while fewer than eight V100s can’t. IDP scikit-learn and Intel DAAL on one CPU node show approximately the same results: the difference in training time is trivial, and due to Intel DAAL function call overhead.

Figure 1 – Speedup of k-means training with Intel DAAL

Table 1. K-means training time (in seconds) and speedup: comparison of IDP scikit-learn, Intel® DAAL, and RAPIDS cuML

The k-means algorithm converges to the same result for all measured cases. Inertia is 4×108. The IDP scikit-learn and Intel DAAL examples are available in the Code Examples section. Moreover, we calculated the k-means training cost as the cost of instances on AWS EC2 multiplied by the training time for eight Intel Xeon processor nodes and eight V100 GPUs, respectively.

k-means training cost ($)

Figure 2 shows that using Intel DAAL with eight CPU nodes results in up to 2.64x reduction in k-means training cost.

Faster and Cheaper

On AWS EC2, distributed k-means computations are 2.76x faster and 2.64x cheaper with Intel DAAL on eight Intel Xeon processor nodes than with RAPIDS cuML on eight V100s. Moreover, Intel Xeon processor-based instances can process larger datasets: data that was easily processed on one Intel Xeon processor-based node causes “out of memory” errors on eight V100s. With Intel Optane,13 memory capacity increases to 4.5 TB per socket (9 TB per two-socket instance), while an NVIDIA DGX-2 has only 512 GB of GPU memory.

Figure 2 – k-means training cost:19 RAPIDS cuML on 8 GPUs and Intel DAAL on eight CPU nodes.
AWS EC2 (N. Virginia) instance prices: c5.24xlarge (Intel DAAL) – $4.08 per hour
($32.64 per hour for eight nodes), p3dn.24xlarge (RAPIDS cuML) – $31.212 per hour.

With the oneAPI programming model,9 Intel delivers high performance, not only on the CPU, but also on its coming architectures (discrete GPU, FPGA, and other accelerators). oneAPI is delivering unified and open programming experience to developers on the architecture of their choice without compromising performance. Intel DAAL is a part of the oneAPI Beta release. You can try it in the Intel® DevCloud development sandbox.14

Intel DAAL k-means Implementation and Optimization Details

The k-means algorithm is based on computation of distances, since it’s used in cluster assignments and objective function (inertia) calculation. The distance in d-dimensional Euclidean space between an observation s = (s1, s2, …, sd) and a centroid c = (c1, c2, …, cd) is described by the following expression:

This is equal mathematically to:

Thus, the squared distance is:

To calculate squared distances, the k-means implementation in Intel DAAL splits observations into blocks of fixed size and then processes them in parallel using Threading Building Blocks.15

Component can be presented as an element mjn of matrix M = 2 S × C, where j is an index of an observation in a block, n is an index of a centroid, and S and C are matrices of a block of observations and centroids, respectively. To calculate this matrix of distance components, Intel DAAL uses matrix multiplication from Intel MKL.16

The first squared distance component, is constant and can be calculated only once for each observation, while two others (  and ) should be recalculated at each iteration.

At all k-means computation stages, Intel DAAL uses vector instructions from Intel® Advanced Vector Extensions 512 (Intel® AVX-512)17 in 2nd-generation Intel Xeon Scalable processors when computing matrix multiplication, centroid assignments, and centroid recalculation.

Using advanced software optimization techniques and enabling hardware features allows Intel DAAL to deliver high performance k-means clustering.

IDP scikit-learn and daal4py Installation

The best way to install scikit-learn from IDP or daal4py (the Python interface for Intel DAAL)18 is by creating a new conda environment:

Code Examples

Dataset generation with scikit-learn:

Running k-means with scikit-learn from IDP:

Running k-means with daal4py:

k-means Training Configuration

  • CPU configuration: c5.24xlarge AWS EC2 instances; 2nd generation Intel Xeon Scalable processors, two sockets, 24 cores per socket, HT on, Turbo on, 192 GB RAM (12 slots/16GB/2933 MHz), BIOS: 1.0 Amazon EC2* (ucode: 0x500002c), OS: Ubuntu* 18.04.2 LTS.
  • GPU configuration: p3dn.24xlarge AWS EC2 instance; Intel® Xeon® Platinum 8175M processor, two sockets, 24 cores per socket, HT on, Turbo on, 768 GB RAM, 8 Tesla V100-SXM2-32G* GPUs with 32 GB GPU memory each, BIOS 1.0 Amazon EC2* (ucode: 0x2000065), OS: Ubuntu 18.04.2 LTS.
  • Software: Python* 3.6, numpy* 1.16.4, scikit-learn 0.21.3, daal4py 2019.5, Intel DAAL 2019.5, Intel® MPI 2019.5, RAPIDS cuML 0.10, RAPIDS cuDF* 0.10, CUDA* 10.1, Nvidia* GPU driver 418.87.01, dask* 2.6.0.
  • Algorithm parameters: Single-precision (float32), number of iterations = 50, threshold = 0, random initial centroids.
Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex.