# Boosting Java* Performance in Big Data Applications

## How New Enhancements Enable Faster and Better Numerical Computing

Since its introduction more than two decades ago, Java* has become a popular enterprise language. Embraced initially for its write-once, run-anywhere property, its usage increased due to the relative ease of building new applications by integrating existing Java projects. The growth of several open-source Java application stacks such as Apache Hadoop* has added to this growth. Fundamentally, developers view Java as enhancing their productivity.

Java programmers rely on the Java Virtual Machine* (JVM*) to deliver on the write-once, run-anywhere promise. The JVM also delivers high performance. By contrast, C and C++ developers rely on the compiler. For these languages, compilation is not in the critical path to runtime performance and happens relatively rarely, so the compiler can do deep analysis and provide esoteric optimizations. The JVM, on the other hand, uses a just-in-time (JIT) compilation model. Compilation is dynamic and happens during every run, so it needs to be lightweight—yet as sophisticated as possible. A lot of work has gone into optimizing the JIT compiler over the years, and for general business processing, the performance of Java has been close to that provided by C or C++.

Today’s business applications―for example, analytics and deep learning—require extensive numerical computing. These use-cases all consume enormous amounts of data while doing large amounts of matrix math. Java applications are a natural fit for marshalling and delivering the data. It’s not an exaggeration to say that big data runs on the JVM. However, the heavy use of floating-point math is more naturally in the domain of hand-tuned kernels written in C or assembly. Improving the Java ecosystem to handle the numerical computing will allow developers to be more productive.

Big data applications, distributed deep learning programs, and artificial intelligence solutions can run directly on top of existing Apache Spark* or Hadoop clusters, and can benefit from efficient scale-out. Machine learning algorithms running on top of the Spark framework, and the current revolution of data science on the JVM, push the need for enhanced single instruction, multiple data (SIMD) support in Java. SIMD support would open up ways to explore new opportunities in areas like high-performance computing (HPC), linear algebra-based machine learning (ML) algorithms, deep learning (DL), and artificial intelligence (AI).

To illustrate, let’s consider the new Math FMA API available in **Open JDK9*** and explore its Java performance in a few ML algorithms on the **Intel® Advanced Vector Instructions (Intel® AVX)**-enabled **Intel® Xeon Phi™ processors**.

### Fused Multiply Add (FMA) Operations

Modern Intel® processors include Intel AVX instructions for performing SIMD operations, including FMA operations (A=A*B+C). FMA is highly valuable for linear algebra-based ML algorithms, deep learning and neural networks (dot product, matrix multiplication), financial and statistical computation models, and polynomial evaluations. FMA instructions perform vectored a*b+c operations on IEEE-754-2008 floating-point values, where the a*b multiplications are performed with infinite precision. The final results of the addition are rounded to produce the desired precision. The FMA instructions have been available in second-generation and later **Intel® Core™ processors**. To leverage them in SIMD operations, the JVM JIT compiler needs to map FMA operations written in Java to further Intel AVX FMA extensions, if available, on the underlying CPU platform.

### FMA API Available in Java 9

Open JDK9 provides the FMA API for Java developers as part of the java.lang.math package. The open JDK9 release includes compiler intrinsics for Intel AVX FMA extensions, which map FMA Java routines to CPU instructions directly on modern CPUs (e.g., Intel Xeon Phi and **Intel® Xeon® Platinum 8180 processors**). This doesn’t require any additional work from the developer. However, Java algorithms designed to use FMA instructions should take into consideration that a non-FMA sequence of packed floating-point multiply and add instructions likely will produce slightly different results compared to FMA. When formulating convergence criteria, it’s important to factor in the difference in the precision of intermediate results to avoid surprises in the final results. In Java, when a*b+c is evaluated as a regular floating-point expression, two rounding errors are involved: the first for the multiply operation, the second for the addition operation.

In Java, FMA operations are supported via the Math FMA API. The FMA routine returns the FMA of the three arguments. It returns the exact product of the first two arguments, summed with the third argument, and then rounded once to the nearest double-precision value. The FMA operation is performed using the java.math. BigDecimal class. Infinity and NaN arithmetic input values aren’t supported BigDecimal; these inputs are handled with two roundings for computation of correct results.

The FMA API takes floating-point inputs a, b, and c and returns floating-point type. It supports both single- and double-precision. The FMA computation is performed in double-precision (discussed later). The implementation first screens for and handles non-finite input values whose arithmetic is not supported by BigDecimal. If all inputs are within finite range, the following expression computes the product of floating-point inputs a and b by explicitly casting them into BigDecimal objects:

BigDecimal product = (new BigDecimal (a)).multiply (new BigDecimal (b));

In special cases, where a third floating-point input c is zero, the API carefully handles the sign of a zero in case the product (a*b) is also zero. The sign for the zero final result is computed by the following floating-point expression:

if (a == 0.0 || b == 0.0) { return a * b + c; }

In cases where c is zero and the product is nonzero, the doubleValue () of the BigDecimal product is returned:

else { return product.doubleValue ( ); }

For all nonzero inputs a, b, and c:

else { return product.add (new BigDecimal (c)). doubleValue ( ); }

The following example shows the A[i]=A[i]+C*B[i] operation using the FMA API available in Open JDK9:

for (int i = 0; i < A.length && i < B.length; i++) A[i] = Math.fma(C, B[i], A[i]);

FMA for both single-precision and double-precision floats is computed in the double format, and then explicitly stored as a float or double to match the return type. Since double has more than twice the precision of the float format, the multiplication of a*b is exact. Addition of c to the product incurs one rounding error. Moreover, since double has more than (2p+2) precision bits compared to the p bits of float, the two roundings of a*b+c, first to double and then secondarily to float, are equivalent to rounding the intermediate result directly to float. The FMA method calls are further mapped to CPU FMA instructions (if available) for higher performance.

When you’re running your Java applications on the latest Open JDK 9 release and source builds, the JVM enables hardware-based FMA intrinsics for hardware where FMA instructions are available (on Intel processors beginning with second-generation Intel Core processors). FMA intrinsics are generated for the java.lang.Math.fma(a,b,c) methods in JDK9, which calculate value of a*b+c expressions.

The JVM option –XX:+PrintIntrinsics can be used to confirm the FMA intrinsification:

@ 6 java.lang.Math::fma (12 bytes) (intrinsic)

### FMA Performance on BLAS Machine Learning Algorithms

In Java programs, computational kernels can be implemented using the Math FMA, which can further leverage FMA hardware extensions on modern CPUs. On the latest Open JDK 10 source builds, BLAS I DDOT (Dot Product) performance can improve up to 3.5X on Intel Xeon Phi processors using Math.fma. The JVM JIT compiler intrinsifies the FMA method calls into hardware instructions, which are further vectorized via auto-vectorization and super-word optimizations into SIMD operations. BLAS-I DAXPY performance can improve up to 2.3X on Intel Xeon Phi processors using the Math FMA on the latest Open JDK source builds.

The original Java DAXPY implementation is shown below:

int _r = n % 4; int _n = n - _r; for (i = 0; i < _n && i < dx.length; i+=4) { dy[i +dy_off] = dy[i +dy_off] + da*dx[i +dx_off]; dy[i+1 +dy_off] = dy[i+1 +dy_off] + da*dx[i+1 +dx_off]; dy[i+2 +dy_off] = dy[i+2 +dy_off] + da*dx[i+2 +dx_off]; dy[i+3 +dy_off] = dy[i+3 +dy_off] + da*dx[i+3 +dx_off]; } for (i = _n; i < n; i++) dy[i +dy_off] = dy[i +dy_off] + da*dx[i +dx_off]

Here is the Math.fma implementation of the DAXPY algorithm:

for (i = 0; i < dx.length; i++) dy[i] = Math.fma (da, dx[i], dy[i]);

The machine code generated by the JVM JIT compiler is shown below. We see that FMA vector SIMD instructions are generated for the Java algorithm and mapped into the SIMD vfmadd231pd instructions on the Intel Xeon Phi processor for the latest Open JDK source build.

vmovdqu64 0x10(%rbx,%r13,8),%zmm2{%k1}{z} vfmadd231pd 0x10(%r10,%r13,8),%zmm1,%zmm2{%k1}{z} ;*invokestatic fma {reexecute=0 rethrow=0 return_oop=0} ; - SmallDoubleDot::daxpy@146 (line 82

### Applications of the Math FMA Interface

The Math FMA application interface is very useful in programs that perform basic linear algebra computations. Dense DGEMM (double-precision matrix multiplication) innermost compute kernels, for instance, can be rewritten using Math.fma operations as follows:

for (j = 0; j < N; j++) { for (l = 0; l < k; l++) { if (b[j + l * ldb + _b_offset] != 0.0) { temp = alpha * b[j + l * ldb + _b_offset]; for (i = 0; i < m; i++) { c[i + j * Ldc + _c_offset] = Math.fma (temp, a[i + l * lda + _a_offset], c[i + j * Ldc + _c_offset]); } } }

Sparse matrix calculations can be rewritten as follows:

for (int steps = 0; steps < NUM_STEPS; steps++) { for (int l = 0; l < ALPHA; l++) { double Total = 0.0; int rowBegin = Rows[l]; int rowEnd = Rows[l+1]; for (int j=rowBegin; j<rowEnd; j++) { Total = Math.fma (A[Cols[j],Value[j],Total); } y[l] = Total; } }

In Binomial Option Pricing, a common financial algorithm, the operation stepsArray[k]=pdByr*stepsArray[k+1]+puByr*stepsArray[k] can be written using FMA:

void BinomialOptions (double[] stepsArray, int STEPS_CACHE_SIZE, double vsdt, double x, double s, int numSteps, int NUM_STEPS_ROUND, double pdByr, double puByr) { for (int j = 0; j < STEPS_CACHE_SIZE; j++) { double profit = s * Math.exp (vsdt * (2.0D * j - numSteps)) - x; stepsArray[j] = profit > 0.0D? profit: 0.0D; } for (int j = 0; j < numSteps; j++) { for (int k = 0; k < NUM_STEPS_ROUND; ++k) { stepsArray[k] = Math.fma (puByr, stepsArray[k], (pdByr * stepsArray[k+1])); } } }

### Boosting Java Performance

New enhancements to Java enable faster and better numerical computing. The JVM has been improved to use the Intel AVX FMA instructions in the implementation of the Math.fma(). This results in significant performance improvements of matrix multiplications, the basic workhorse of HPC and AI applications.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.