1.47x Speed-Up for Popular Machine Learning Library

Yandex optimizes the performance of CatBoost with Intel® VTune™ Profiler Hotspot Analysis

Yandex is the No. 1 Internet/cloud company in Russia and a strong contributor to machine learning and artificial intelligence worldwide. Its popular CatBoost is a high-performance, open source library for gradient boosting on decision trees.

When Yandex needed to identify performance bottlenecks in CatBoost, it collaborated with Intel’s Software Development team, using Intel® VTune™ Profiler key debugging tools from the Intel® oneAPI Base Toolkit for hot spot analysis of the CatBoost framework on several datasets. By identifying bottlenecks, Yandex was able to speed up the performance of CatBoost by 1.47x on Intel® platforms.

Efficient Machine Learning Models

Yandex researchers developed CatBoost for training and prediction on machine learning models. Yandex and other prominent companies, including CERN and Cloudflare, rely on CatBoost’s features. Developers can cut the time they spend on parameter tuning using CatBoost’s default parameters. To improve training results, CatBoost makes it possible to use non-numeric factors instead of having to pre-process data or spend time and effort turning it to numbers. Users can train their models on a fast implementation of a gradient-boosting algorithm. A model applier lets users apply their trained model quickly and efficiently, even to latency-critical tasks.

To maximize the value of CatBoost, Yandex needed to ensure that the performance on CPU  bare metal or cloud is optimal. To ensure top performance, it used the Intel® Software Development tool Intel VTune Profiler.

Maximizing CatBoost’s Performance

Yandex evaluated CatBoost’s performance on several open-sourced datasets for Intel® CPU platforms including Intel® Xeon® and Xeon® Scalable processors (Figures 1 and 2).

Intel VTune Profiler analyzed the  code, collecting key profiling data and presenting its findings through an interface that simplifies interpretation and helps developers focus on the most effective software optimizations, from computation and threading to memory and storage.

Yandex tested the training time of the datasets listed in the left in Figure 1 and Figure 2 and demonstrated the speed-up of these models with the optimizations suggested by Intel VTune Profiler

Intel VTune Profiler’s hot spot analysis demonstrated issues with false sharing and extra atomic usage that were compromising memory access efficiency. By identifying bottlenecks, Yandex was able to speed up the performance of CatBoost by 1.47x.

Finding Bottlenecks and Boosting Performance

This joint effort of the Intel and Yandex teams is helping data scientists train more complicated models and datasets faster on Intel platforms, and raising the popularity of the CatBoost machine learning library among the developer community. CatBoost’s performance results will help  data scientists around the world utilize their compute resources more efficiently and save on cloud resources.

Intel® Software tools proved effective for Yandex software developers and bring value for data scientists  worldwide.

Learn More

Figure 1. Intel Xeon processor 6230 used for training, 40 physical cores with 1 thread per physical core

 

Figure 2. Intel Xeon processor E5-2660V4  with 2 sockets, 14 cores per socket, 2 HT per core, 1 thread per physical core

 

For more complete information about compiler optimizations, see our Optimization Notice.