Effectively Train and Execute Machine Learning and Deep Learning Projects on CPUs

Meet the Intel-Optimized Frameworks that Make It Easier

When you’re developing AI applications, you need highly optimized deep learning models that enable an app to run wherever it’s needed and on any kind of device—from the edge to the cloud. But optimizing deep learning models for higher performance on CPUs presents a number of challenges, like:

  • Code refactoring to take advantage of modern vector instructions
  • Use of all available cores
  • Cache blocking
  • Balanced use of prefetching
  • And more

These challenges aren’t significantly different from those you see when you’re optimizing other performance-sensitive applications—and developers and data scientists can find a wealth of deep learning frameworks to help address them. Intel has developed a number of optimized deep learning primitives that you can use inside these popular deep learning frameworks to ensure you’re implementing common building blocks efficiently through libraries like Intel® Math Kernel Library (Intel® MKL).

In this article, we’ll look at the performance of Intel’s optimizations for frameworks like Caffe*, TensorFlow*, and MXNet*. We’ll also introduce the type of accelerations available on these frameworks via the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) and show you how to acquire and/or build these framework packages with Intel’s accelerations―so you can take advantage of accelerated CPU training and inference execution with no code changes (Figures 1 and 2).

Figure 1 – Deliver significant AI performance with hardware and software optimizations on Intel® Xeon® Scalable processors.

Figure 2 – Boost your deep learning performance on Intel Xeon Scalable processors with Intel® Optimized TensorFlow and Intel MKL-DNN.

Intel® Math Kernel Library for DNN

Intel MKL-DNN is an open-source performance library that accelerates deep learning applications and frameworks on Intel® architectures. Intel MKL-DNN contains vectorized and threaded building blocks that you can use to implement deep neural networks (DNN) with C and C++ interfaces (Table 1).

The performance benefit from Intel MKL-DNN primitives is tied directly to the level of integration to which the framework developers commit (Figure 3). There are reorder penalties for converting input data into Intel MKL-DNN preferred formats, so framework developers benefit from converting once and then staying in Intel MKL-DNN format for as much of the computation as possible.

Also, 2-in-1 and 3-in-1 fused versions of layer primitives are available if a framework developer wants to fully leverage the power of the library. The fused layers allow for Intel MKL-DNN math to run concurrently on downstream layers if the relevant upstream computations are completed for that piece of the data/image frame. A fused primitive will include compute-intensive operations along with bandwidth-limited ops.

Table 1. What’s included in Intel MKL-DNN

Function Features
  Compute-intensive
  operations
  • 1D, 2D and 3D spatial convolution and deconvolution
  • Inner product
  • General-purpose matrix-matrix multiplication
  • Recurrent neural network (RNN) cells
  Memory-bound operations
  • Pooling
  • Batch normalization
  • Local response normalization
  • Activation functions
  • Sum
  Data manipulation
  • Reorders/quantization
  • Concatenation
  Primitive fusion
  • Convolution with sum and activations
  Data types
  • fp32
  • int8

 

Figure 3 – Performance versus level of integration and Intel MKL-DNN data format visualization

Installing Intel MKL-DNN

Intel MKL-DNN is distributed in source code form under the Apache* License Version 2.0. See the Readme for up-to-date build instructions for Linux*, macOS*, and Windows*.

The VTUNEROOT flag is required for integration with Intel® VTune™ Amplifier. The Readme explains how to use this flag.

Installing Intel-Optimized Frameworks

Intel® Optimization for TensorFlow*

Current distribution channels are PIP, Anaconda, Docker, and build from source. See the Intel® Optimization for TensorFlow* Installation Guide for detailed instructions for all channels.

Anaconda – Linux:

  conda install -c defaults tensorflow

Anaconda – Windows:

  conda install tensorflow-mkl -c defaults

Intel® Optimization for Caffe*

Intel has a tutorial describing how to use Intel® Optimization for Caffe* to build Caffe optimized for Intel architecture, train deep network models using one or more compute nodes, and deploy networks.

  (Ubuntu 16.04)
  git clone https://github.com/intel/caffe.git 
  Open a Terminal window 
  sudo apt-get update 
  sudo apt-get install build-essential cmake git pkg-config 
  sudo apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev 
  libhdf5-serial-dev protobuf-compiler 
  sudo apt-get install libatlas-base-dev 
  sudo apt-get install ––no-install-recommends libboost-all-dev 
  sudo apt-get install libgflags-dev libgoogle-glog-dev liblmdb-dev 
  sudo apt-get install libopencv-dev 
  Go to Caffe root directory. 
  cp Makefile.config.example Makefile.config 
  vi Makefile.config (add the red part) 
  INCLUDE DIRS := $(PYTHON INCLUDE) /usr/local/include /usr/include/hdf5/serial 
  LIBRARY DIRS := $(PYTHON LIB) /usr/local/lib /usr/lib /usr/lib/x86_64- 
  linux-gnu /usr/lib/x86_64-linux-gnu/hdf5/serial 
  make all -j4 

Intel® Optimization for MXNet*

Intel has a tutorial explaining Intel® Optimization for Apache* MXNet.

  git clone ––recursive https://github.com/apache/incubator-mxnet.git 
  cd mxnet && make -j $(nproc) USE_OPENCV=1 USE_BLAS=mk1 USE MKLDNN=1 

Performance Considerations and Runtime Settings

Now let’s consider TensorFlow runtime settings for best performance―specifically, convolutional neural network (CNN) inference. The concepts can be applied to other frameworks accelerated with Intel MKL-DNN and other use cases. However, some empirical testing will be required. Where necessary, we’ll give different recommendations for real-time inference (RTI) with batch size of 1 and maximum throughput (MxT) with tunable batch size.

Maximum Throughput versus Real-Time Inference

Deep learning inference is usually done with two different strategies, each with different performance measurements and recommendations:

  • Max Throughput (MxT) looks to process as many images per second, passing in batches of size > 1. We can achieve the best performance by exercising all the physical cores on a socket. This solution is intuitive in that we simply load up the CPU with as much work as we can, and process as many images as we can, in a parallel and vectorized fashion.
  • Real-time Inference (RTI) is an altogether different scenario where we want to process a single image as quickly as possible. Here, we aim to avoid penalties from excessive thread launching and orchestration between concurrent processes. The strategy is to confine and execute quickly.

Let’s discuss some best-known methods (BKMs) for maximizing MxT and RTI performance.

TensorFlow Runtime Options Affecting Performance

These runtime options heavily affect TensorFlow performance. Understanding them will help you get the best performance out of Intel’s optimizations. BKMs differ for MxT and RTI. The runtime options are: {intra|inter}_op_parallelism_threads and data layout.

{intra|inter}_op_parallelism_threads

  • Recommended settings (MxT): intra_op_parallelism = #physical cores
  • Recommended settings (RTI): intra_op_parallelism = #physical cores
  • Recommended settings for inter_op_parallelism: 2
  • Usage (shell):
    python script.py ––num_intra_threads=cores ––num_inter_threads=2 ––mkl=True

intra_op_parallelism_threads and inter_op_parallelism_threads are environment variables defined in tensorflow.ConfigProto. The ConfigProto is used for configuration when creating a session. These two environment variables control number of cores to use.

The intra_op_parallelism_threads environment variable controls parallelism inside an operation. For instance, if matrix multiplication or reduction is intended to be executed in several threads, this environment variable should be set. TensorFlow will schedule tasks in a thread pool which contains intra_op_parallelism_threads threads. OpenMP threads are bound to thread context as closely as possible on different cores. Setting this environment variable to the number of available physical cores is recommended.

The inter_op_parallelism_threads environment variable controls parallelism among independent operations. Since these operations are not relevant to each other, TensorFlow will try to run them concurrently in the thread pool, which contains inter_op_parallelism_threads threads. To minimize effects that will be brought to intra_op_parallelsim_threads threads, this environment variable is recommended to be set to the number of sockets where you want the code to run. For the Intel Optimization of TensorFlow, we recommend keeping the entire execution on a single socket.

Data Layout

  • Recommended settings: Data_format = NCHW
  • Usage (shell):
    python
    script.py ––num_intra_threads=cores ––num_inter_threads=2 ––mkl=True
    data_format=NCHW

In modern Intel architectures, efficient use of cache and memory greatly impacts overall performance. A good memory access pattern minimizes the performance cost of accessing data in memory. To achieve this, it’s important to consider how data is stored and accessed. This is usually referred as data layout. It describes how multidimensional arrays are stored linearly in the memory address space.

In most cases, data layout is represented by four letters for a two-dimensional image.

  • N: Batch size, indicating number of images in a batch
  • C: Channel, indicating number of channels in an image
  • W: Width, indicating number of pixels in horizontal dimension of an image
  • H: Height, indicating number of pixels in vertical dimension of an image

Figure 4 – Data format/layout: NHWC versus NCHW.

The order of these four letters indicates how pixel data are stored in 1-d memory space. For instance, NCHW indicates pixel data are stored in width-wise first, then height-wise, then channel-wise, and finally batch-wise (Figure 4). The data is then accessed from left to right with channels-first indexing. NCHW is the recommended data layout for Intel MKL-DNN because this is an efficient layout for the CPU. TensorFlow uses NHWC as the default data layout, but it also supports NCHW.

NUMA Controls Affecting Performance

  • Recommended settings: ––cpunodebind=0 ––membind=0
  • Usage (shell):
    numactl ––cpunodebind=0 ––membind=0 python
    script.py ––num_intra_threads=cores ––num_inter_threads=2 ––mkl=True
    data_format=NCHW

Running on a NUMA-enabled machine brings with it special considerations. NUMA, or non-uniform memory access, is a memory layout design used in data center machines meant to take advantage of locality of memory in multi-socket machines with multiple memory controllers and blocks. The Intel Optimization for TensorFlow runs best when confining both the execution and memory usage to a single NUMA node.

Intel MKL-DNN Technical Performance Considerations

The library takes advantage of SIMD instructions through vectorization, as well as multiple cores through multithreading. Vectorization effectively utilizes cache and the latest instruction sets. On modern Intel processors, a single core can perform up to two fused multiply and add (FMA) operations on 16 singleprecision or 64 int8 numbers per cycle. Moreover, the technique of multi-threading helps in performing multiple independent operations simultaneously. Since deep learning tasks are often independent, getting available cores working in parallel is an obvious choice to boost performance.

To achieve the best possible CPU utilization, Intel MKL-DNN may use hardware-specific buffer layouts for compute-intensive operations, including convolution and inner product. All the other operations will run on the buffers in hardware-specific layouts or common layouts used by frameworks.

Intel MKL-DNN uses OpenMP to express parallelism. OpenMP is controlled by various environment variables: KMP_AFFINITY, KMP_BLOCKTIME, OMP_NUM_THREADS, and KMP_SETTINGS. These environment variables will be described in detail in the following sections. Changing the values of these environment variables affects performance of the framework, so we highly recommend that users tune them for their specific neural network model and platform.

KMP_AFFINITY

  • Recommended settings:
    KMP_AFFINITY=granularity=fine,verbose,compact,1,0
  • Usage (shell):
    numactl ––cpunodebind=0 ––membind=0 python
    script.py ––num_intra_ threads=cores ––num_inter_threads=2 ––mkl=True
    data_format=NCHW ––kmp_affinity=granularity=fine,verbose,compact,1,0

KMP_AFFINITY is used to restrict execution of certain threads to a subset of the physical processing units in a multiprocessor computer. Set this environment variable as follows:

  KMP_AFFINITY=[<modifier>,...]<type>[,<permute>][,<offset>]
  • Modifier is a string consisting of a keyword and specifier.
  • Type is a string indicating the thread affinity to use.
  • Permute is a positive integer value that controls which levels are most significant when sorting the machine topology map. The value forces the mappings to make the specified number of most significant levels of the sort the least significant, and then inverts the order of significance. The root node of the tree is not considered a separate level for the sort operations.
  • Offset is a positive integer value that indicates the starting position for thread assignment.

We’ll use the recommended setting of KMP_AFFINITY as an example to explain basic content of this environment variable:

  KMP_AFFINITY=granularity=fine,verbose,compact,1,0

The modifier is granularity=fine,verbose. The word fine causes each OpenMP thread to be bound to a single thread context, and verbose prints messages concerning the supported affinity, e.g.,

  • The number of packages
  • The number of cores in each package
  • The number of thread contexts for each core
  • OpenMP thread bindings to physical thread contexts

The word compact is the value of type, assigning the OpenMP thread <n>+1 to a free thread context as close as possible to the thread context where the <n> OpenMP thread was placed.

Figure 5 shows the machine topology map when KMP_AFFINITY is set to these values. The OpenMP thread <n>+1 is bound to a thread context as closely as possible to the OpenMP thread <n>, but on a different core. Once each core has been assigned an OpenMP thread, the subsequent OpenMP threads are assigned to the available cores in the same order, but they are assigned on different thread contexts.

Figure 5 – Machine topology map with the setting KMP_AFFINITY=granularity=fine,compact,1,0

The advantage of this setting is that consecutive threads are bound close together so that communication overhead, cache line invalidation overhead, and page thrashing are minimized. It’s desirable to avoid binding multiple threads to the same core and leaving other cores not utilized. For more detailed description of KMP_AFFINITY, see the Intel® C++ Compiler Developer Guide and Reference.

KMP_BLOCKTIME

  • Recommended settings for CNN: KMP_BLOCKTIME=0
  • Recommended settings for non-CNN: KMP_BLOCKTIME=1 user should verify empirically)
  • Usage (shell):
    numactl ––cpunodebind=0 ––membind=0 python
    script.py ––num_intra_threads=cores ––num_inter_threads=2 ––mkl=True
    data_format=NCHW ––kmp_affinity=granularity=fine,verbose,compact,1,0
    ––kmp_blocktime=0( or 1)

This environment variable sets the time, in milliseconds, that a thread should wait after completing the execution of a parallel region before going to sleep. Default value is 200 ms.

After completing the execution of a parallel region, threads wait for new parallel work to become available. After a certain period of time has elapsed, they stop waiting and sleep. Sleeping allows the threads to be used, until more parallel work becomes available, by non-OpenMP threaded code that may execute between parallel regions, or by other applications. A small KMP_BLOCKTIME value may offer better overall performance if the application contains non-OpenMP threaded code that executes between parallel regions. A larger KMP_BLOCKTIME value may be more appropriate if threads are to be reserved solely for OpenMP execution, but may penalize other concurrently-running OpenMP or threaded applications. The suggested setting is 0 for CNN-based models.

OMP_NUM_THREADS

  • Recommended settings for CNN: OMP_NUM_THREADS = # physical cores
  • Usage (shell): Export OMP_NUM_THREADS= # physical cores

This environment variable sets the maximum number of threads to use in OpenMP parallel regions if no other value is specified in the application. The value can be a single integer, in which case each integer specifies the number of threads for a parallel region at each nesting level. The first position in the list represents the outermost parallel nesting level. The default value is the number of logical processors visible to the operating system on which the program is executed. The recommended value equals the number of physical cores.

KMP_SETTINGS

  • Usage (shell): Export KMP_SETTINGS=TRUE

This environment variable enables (TRUE) or disables (FALSE) the printing of OpenMP runtime library environment settings during program execution.

Learn More

Start using Intel® optimized frameworks to accelerate your deep learning workloads on the CPU today. Check out our helpful resources on www.intel.ai and get support from the Intel® AI Developer Forum. Also, visit the new Intel® AI Model Zoo for solution-oriented resources for your accelerated TensorFlow* projects. Use these resources and you can have confidence that you’re using your CPU resources to their fullest capability.

References

Configuration Note 1

INFERENCE using FP32 Batch Size Caffe GoogleNet v1 128 AlexNet 256.

Configurations for Inference throughput: Tested by Intel as of 6/7/2018. :Platform :two-socket Intel® Xeon® Platinum processor, 8180 CPU @ 2.50GHz /28 cores HT ON , Turbo ON. Total Memory: 376.28GB (12slots /32 GB /2666 MHz), four instances of the framework, CentOS Linux*-7.3.1611-Core , SSD sda RS3WC080 HDD 744.1GB,sdb RS3WC080 HDD 1.5TB,sdc RS3WC080 HDD 5.5TB , Deep Learning Framework Caffe version: a3d5b022fe026e9092fc7abc7654b1162ab9940d. Topology:GoogleNet* v1 BIOS:SE5C620.86B.00.01.0004.071220170215 MKLDNN: version: 464c268e544bae26f9b85a2acb9122c766a4c396 NoDataLayer. Measured: 1449 imgs/sec vs Tested by Intel as of 06/15/2018 Platform: 2S Intel® Xeon® processor CPU E5-2699 v3 @ 2.30GHz (18 cores), HT enabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 64GB DDR4-2133 ECC RAM. BIOS: SE5C610.86B.01.01.0024.021320181901, CentOS Linux-7.5.1804(Core) kernel 3.10.0-862.3.2.el7.x86_64, SSD sdb INTEL SSDSC2BW24 SSD 223.6GB. Framework BVLC-Caffe: https://github.com/BVLC/caffe, Inference & Training measured with “caffe time” command. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. BVLC Caffe (http://github.com/BVLC/caffe), revision 2a1c552b66f026c7508d390b526f2495ed3be594.

Configuration for training throughput: Tested by Intel as of 05/29/2018 Platform :2 socket Intel Xeon Platinum processor 8180 CPU @ 2.50GHz / 28 cores HT ON , Turbo ON Total Memory 376.28GB (12slots / 32 GB / 2666 MHz),4 instances of the framework, CentOS Linux-7.3.1611-Core , SSD sda RS3WC080 HDD 744.1GB,sdb RS3WC080 HDD 1.5TB,sdc RS3WC080 HDD 5.5TB , Deep Learning Framework Caffe version: a3d5b022fe026e9092fc7abc765b1162ab9940d Topology:alexnet BI OS:SE5C620.86B.00.01.0004.071220170215 MKLDNN: version: 464c268e544bae26f9b85a2acb9122c766a4c396 NoDataLayer. Measured: 1257 imgs/sec vs. Tested by Intel as of 06/15/2018 Platform: 2S Intel® Xeon® processor CPU E5-2699 v3 @ 2.30GHz (18 cores), HT enabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 64GB DDR4-2133 ECC RAM. BIOS: SE5C610.86B.01.01.0024.021320181901, CentOS Linux-7.5.1804 (Core) kernel 3.10.0-862.3.2.el7.x86_64, SSD sdb INTEL SSDSC2BW24 SSD 223.6GB. Framework BVLC-Caffe: https://github.com/BVLC/caffe, Inference nd training measured with “caffe time” command. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. BVLC Caffe (http://github.com/BVLC/caffe), revision 2a1c552b66f026c7508d390b526f2495ed3be594

Configuration Note 2

System configuration: CPU Thread(s) per core: 2 Core(s) per socket: 28 socket(s): 2 NUMA node(s): 2 CPU family: 6 Model: 85 Model name: Intel Xeon Platinum processor 8180 CPU @ 2.50GHz HyperThreading: ON Turbo: ON Memory 376GB (12 x 32GB) 24 slots, 12 occupied 2666 MHz Disks Intel RS3WC080 x 3 (800GB, 1.6TB, 6TB) BIOS SE5C620.86B.00.01.0004. 070920180847 (microcode version 0x200004d) OS Centos Linux 7.4.1708 (Core) Kernel 3.10.0-693.11.6.el7.x86_64 TensorFlowSource: https://github.com/tensorflow/tensorflow commit: 6a0b536a779f485edc25f6a11335b5e640acc8ab MKLDNN version: 4e333787e0d66a1dca1218e99a891d493dbc8ef1 TensorFlow benchmarks: https://github.com/tensorflow/benchmarks

Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex.