Speeding Up Monte Carlo Simulation with oneAPI
Intel® oneAPI Math Kernel Library (Beta) Data Parallel C++ Usage Models
Intel® oneAPI Math Kernel Library (beta) (oneMKL beta) gives developers and data scientists enhanced math routines for creating science, engineering, or financial applications. You can use it to optimize code for current and future generations of Intel® CPUs and GPUs. In this article, we’ll apply some oneMKL beta Data Parallel C++ (DPC++) usage models to a Monte Carlo (MC) simulation example using the oneMKL beta random number generators (RNG).
We’ll look at five usage models:
- Reference C++
- oneMKL beta DPC++
- Extended oneMKL beta DPC++
- Heterogeneous parallel implementations
- Implementations based on different memory allocation techniques
The bottom line? Minimizing data transfer between the host and device significantly improves performance.
Computing π by Numerical Integration
MC simulations are a broad class of computational algorithms that use repeated, random sampling to obtain numerical results1. Figure 1 shows how to compute π using the MC method.
We’ll consider a quadrant inscribed in a unit square. The area of sector S is equal to Area(S)=1/4 πr2=π/4, and the area of the square R is equal to Area(R)=1. If we randomly choose the point c=(x,y) from the unit square R, the probability that c is within sector S is:
We can consider n points, where n is sufficiently large, and count the number of points falling into S – k. Then, we can approximate the probability by the ratio k/n, or:
We can approximate π as:
According to the laws of large numbers, the larger the n, the more accurate our π approximation. More accurately, from the Bernoulli theorem, for any ε>0
If x and y coordinates of the tested point c are 0≤x≤1 and 0≤y≤1 (abscissa and ordinate, respectively), then c falls into sector S when:
- Generate n 2D points where each point is represented by two uniformly distributed random numbers over the [0,1) interval.
- Count the number of points that fall into sector S.
- Calculate the approximate value of π using the previous formula
Reference C++ Example of π Estimation
Let’s consider a function estimate_pi that takes a number of 2D points n_points and performs the computation described above:
This example uses RNG functionality from the C++ 11 standard. In Step 1, we initialize the RNG by creating two instances: engine and distribution. The engine holds a state of a generator and provides independent and identically distributed random variables. distr describes transforming generator output with statistics and parameters. In this example, uniform_real_distribution produces random floatingpoint values, uniformly distributed in the interval [a, b).
In Step 1.2, random numbers are obtained by passing the engine to an operator() distribution. A single floating-point variable is obtained. The loop fills vectors x and y with random numbers. In Step 2, each (x, y) position is checked to determine how many points fall into the S sector. The result is stored in the n_under_curve variable. Finally, the estimated π value is calculated and returned to the main program (Step 3).
oneMKL Beta DPC++ Example of π Estimation
Now let’s modify the previous estimate_pi function and add cl::sycl::queue. DPC++ lets you choose a device to run on using a selector interface:
cl::sycl::queue is an input argument to oneMKL beta functions. The library’s kernels are submitted in this queue, with no code changes required to switch between devices (i.e., host or accelerator). CPU and GPU devices are available in the oneMKL beta release.
Instead of std::vector, cl::sycl::buffer is used to store the random numbers:
Buffers manage data transfer between the host application and device kernels. The buffer and accessor classes are responsible for tracking memory transfers and guaranteeing data consistency across the different kernels.
In Step 1, oneMKL beta RNG APIs also initialize two entities (the RNG engine and random number distribution) as shown in Step 1.1. engine takes cl::sycl::queue and an initial SEED value as input to the constructor. The distribution mkl::rng::uniform has template parameters for the type of output values and method used to transform the engine’s output (see the Intel® oneAPI Math Kernel Library Data Parallel C++ Developer Reference for details) and distribution parameters.
The mkl::rng::generate function is called at Step 1.2 to obtain random numbers. This function takes the distribution and engine created in the previous step, the number of elements to be generated, and storage for the result in buffers. The oneMKL beta RNG API mkl::rng::generate() is vectorized. Vector library subroutines often perform better than scalar routines. (See Intel® oneMKL Vector Statistics Notes for details.)
In Step 2, the random numbers are postprocessed on the host. Host accessors for buffers are created to access the data:
Other steps are the same as in the C++ reference example.
Extended oneMKL Beta DPC++ Example of π Estimation
We can optimize Step 2 in the previous example using the DPC++ Parallel STL function. This approach reduces data transfer between the host and device, which improves performance. The modification to Step 2 is as follows:
The zip iterator provides a pair of random numbers as input to the count_if function. Other steps are the same as in the previous oneMKL beta DPC++ example.
Unified Shared Memory (USM)-Based Example of oneMKL Beta DPC++ π Estimation
There are two approaches to DPC++ pointer-based memory management: cl::sycl::malloc and cl::sycl::usm_allocator (find more details here).
cl::sycl::malloc allows you to allocate memory directly on the host or device, or to access memory from both the host and device:
When you use this memory allocation, you get all the advantages of pointer arithmetic as well. This approach requires you to free any allocated memory.
The cl::sycl::usm_allocator approach allows you to work with standard or user containers without worrying about direct memory control:
Generation is performed in the same way, but instead of cl::sycl::buffer<float, 1>, float* is used:
Each function returns a cl::sycl::event that can be used for synchronization. You can call wait() or wait_and_throw() functions for these events or for the entire queue. This allows manual control of data dependencies between DPC++ kernels by calling event.wait() or queue.wait() functions.
To obtain random numbers and to count points in Step 2, you can use vectors/pointers natively without creating host accessors:
Heterogeneous Execution of oneMKL Beta DPC++ π Estimation
We can modify the previous examples to offload part of the computation to an accelerator instead of doing the entire computation on the host. The APIs stay the same. You just need to choose the device for cl::sycl::queue. Two queues are needed for parallel execution on different devices:
You can also allocate memory separately. CPU allocation can be done directly on the host:
oneMKL beta RNG engines should be constructed for the exact type of device, so two objects are required. The second engine may be initialized with another seed, or should continue the sequence offset by n_points. (See RNGs in parallel computations in Intel® MKL Vector Statistics Notes for details.)
Generation and postprocessing may be called as in the previous examples. In this example, std::count_if is used on a host with a parallel execution policy:
This approach allows balancing work between devices and/or completes different tasks in parallel on any supported devices within a single API.
Figure 2 compares the oneMKL beta DPC++ and Extended oneMKL beta DPC++ examples.
Hardware and software parameters:
- Hardware: Intel® Core™ i9-9900K processor @ 3.60GHz, Intel® Gen12LP HD Graphics, NEO Graphics NEO
- Operating system: Ubuntu* 18.04.2 LTS
- Software: oneMKL beta
- Number of generated 2D points: 108
- Random number engine: mkl::rng::philox4x32x10
- Random number distribution: Uniform single-precision
- Measurement region: Computational part of estimate_pi() functions (without memory allocation overhead)
Speeding Up Performance
We’ve demonstrated different oneMKL beta DPC++ usage models applied to π estimation by MC simulations. Slightly modifying the reference C++ example will let you use DPC++ features and oneMKL beta functions to execute code on the different supported devices, including heterogeneous execution. Reducing data transfer between the host and device can significantly improve performance, as shown in Figure 2.
1. Knuth, Donald E. The Art of Computer Programming, Volume 2, Seminumerical Algorithms, 2nd edition, Addison-Wesley Publishing Company, Reading, Massachusetts, 1981.