A New Approach to Parallel Computing Using Automatic Differentiation
Getting Top Performance on Modern Multicore Systems
If you’re interested in high-performance computing, high-level, object-oriented languages aren’t the first things that come to mind. Object abstractions come with a runtime penalty and are often difficult for compilers to vectorize. Adapting your code for multithreading execution is a huge challenge, and the resulting code is often a headache to maintain.
You’re in luck if performance-critical parts of your code are localized and can be flattened and safely parallelized. However, many performance-critical problems can benefit from object-oriented programming abstractions. We’re proposing a different programming model that lets you achieve top performance on single instruction multiple data (SIMD), non-uniform memory access (NUMA) multicore systems.
Operator Overloading for Valuation Graph Extraction
We’ll focus on problems where the same function, F(1), needs to be executed on a data set X[i]. For example, let’s look at Monte Carlo simulations in the finance world where X[i] is a random sample and F(.) is a pricing function (Figure 1). We use an operator overloading pattern to extract all primitive operations performed by F(.).
This pattern is very common in automatic adjoint differentiation (AAD) libraries. Unlike traditional AAD libraries, we don’t build a data structure to represent the valuation graph. Instead, we compile binary machine code instructions to replicate valuations as defined in the graph, which can be seen as a just-in-time (JIT) compilation. However, we don’t work with the source code directly. Instead, we compile a valuation graph produced by the user’s algorithm. Since we want to apply F(.) to a large set of data points, we can compile this code to expand all scalar operations to full SIMD vector operations and process four (AVX2) or eight (AVX-512) data samples in parallel.
Learn by Example
Let’s look at a simple option pricing framework where we use various abstract business objects. In this example, we simulate asset values as a random process:
The classes BankRate and AssetVolatility can define different ways of computing model parameters, and implementation can be done deep in the derived classes. This function can be used with the native double type. When applied along timepoints, t[i] can be used to simulate asset value at the option expiry:
However, this leads to bad performance because the compiler can’t effectively vectorize the code and business objects may contain virtual function calls. Using the AAD runtime compiler, we can execute the function, record one random path of asset evolution, and compute option intrinsic value at the expiry:
At this stage, the func object contains compiled, vectorized machine code that replicates valuations to produce the final payoff output value given the arbitrary random_samples vector as input. The function object remains constant after recording and requires memory context for execution:
Making efficient and safe multithreaded code can be difficult. Notice that the recording happens only for one input sample and can be executed in the controlled, stable, single-threaded environment. The resulting recorded function, however, is threadsafe and only needs separate workspace memory allocated for each thread. This is a very attractive property, since it lets us turn non-multi-thread-safe code into something that can be safely executed on multicore systems. Even optimal NUMA memory allocation becomes a trivial task. (You can view the full code listing for the multithreaded example here.)
This technique not only accelerates your function, it can also create an adjoint function to compute derivatives of all inputs with respect to all outputs. This is similar to the back-propagation algorithm used for deep neural network (DNN) training. Unlike DNN training libraries, this approach works for almost any arbitrary C++ code. To record an adjoint function, simply mark which input variables are required for differentiation:
Finally, to execute the adjoint function, initialize the gradient values of outputs and call the reverse() method on the function object:
Getting Top Performance
Hardware is evolving toward increasing parallelism with a lot more cores, wider vector registers, and accelerators. For object-oriented programmers, it’s hard to adapt single-threaded code to existing parallel methods like OpenMP and CUDA. Using the AADC tool from Matlogica, programmers can turn their object-oriented, single-threaded, scalar code into AVX2/AVX512 vectorized, multithreaded, and threadsafe lambda functions. Crucially, the AADC tool can also generate a lambda function for the Adjoint method of computing, with all required derivatives using the same interface. Visit Matlogica for more details and a demo version of AAD-C.
Evgeny Lakshtanov is partially supported by Portuguese funds through the Center for Research and Development in Mathematics and Applications (CIDMA) and the Portuguese Foundation for Science and Technology (FCT, Fundação para a Ciência e a Tecnologia), within project UIDP/04106/2020.