One-Line Code Changes to Boost Pandas, Scikit-Learn, and TensorFlow Performance
Data scientists tackle a wide array of everyday problems—from healthcare to finance to Netflix show ratings—using data-driven decision-making. They employ a variety of tools such as the popular pandas, Scikit-learn, and TensorFlow frameworks to handle data preprocessing, classical machine learning (ML), and deep learning (DL) to make their models and visualizations more accurate. Given that most analytics workflows are data- and compute-intensive, how can time-to-solution be reduced without putting too much burden on data scientists to modify their code? This article illustrates how Intel’s latest AI software optimizations drastically improve data analytics performance on Intel® processor-based platforms with minimal code changes.
Intel® Distribution of Modin*
Let’s start with pandas, the popular Python data preprocessing and analysis library beloved by data scientists for its ease of use. The latest version only runs on a single core, even though modern processors offer a lot of cores per processor. Preprocessing can go from minutes to hours or even days as the data size increases. Lack of scaling causes many data scientists to give up on pandas and switch to another framework like Apache Spark. Unlike pandas, however, Spark is not as user-friendly and usually requires data scientists to modify their workflows.
Intel has a better solution for seamless scaling. Modin provides a simple solution that supports the pandas API. Intel® Distribution of Modin*, with its powerful OmniSci backend, provides a scalable pandas API with a one-line code change:
The performance improvement for the common NYCTaxi dataset (approximately 1.1 billion individual taxi trips in New York City) is significant:
See Data Science at Scale with Modin for more details.
Intel® Extension for Scikit-Learn*
After preprocessing, the next step in the data science pipeline is often data modeling with the popular Scikit-Learn ML library. Like pandas, Scikit-Learn doesn’t take advantage of instruction- or thread-level parallelism. Using Intel® Extension for Scikit-Learn* can significantly speedup ML performance (38x on average and up to 200x depending on the algorithm) by changing only two lines of code:
See Intel Gives Scikit-Learn the Performance Boost Data Scientists Need and the Intel Extension for ScikitLearn documentation for more details.
Intel® Optimization for TensorFlow*
TensorFlow is a very popular framework best known for DL model development and deployment. However, TensorFlow previously had not been optimized for Intel processors. As of TensorFlow v2.5, Intel® oneAPI Deep Neural Network Library (oneDNN) will be officially available as part of the official TensorFlow package. The built-in optimizations in oneDNN are easily enabled without code modifications. Just set an environment variable to get up to 3x speedup:
See Leverage Intel Deep Learning Optimizations in TensorFlow for more information about the oneDNN optimizations that are available in TensorFlow 2.5.
If you are working on a complex pipeline that requires one or more of these libraries, take a look at Intel® oneAPI AI Analytics Toolkit, which provides a comprehensive set of interoperable Python packages, including those mentioned in this article.
For more information on how to install these tools and take advantage of these optimizations, check out the following resources:
- Intel Distribution of Modin Resource Page
- Modin Documentation
- Intel Extension for Scikit-Learn Getting Started Guide
- Intel Optimization for TensorFlow Getting Started Guide
- TensorFlow GitHub
- Intel oneAPI AI Analytics Toolkit Download
- Intel oneAPI AI Analytics Toolkit Getting Started Guide
- oneAPI Basics Training Series