Julia*: A High-Level Language for Supercomputing

The Julia Project Continues to Break New Boundaries in Scientific Computing

High-performance computing (HPC) has been at the forefront of scientific discovery. Scientists routinely run simulations on millions of cores in distributed environments. The key software stack involved has typically been a statically compiled language such as C/C++ or Fortran, in conjunction with OpenMP*/MPI*. This software stack has stood the test of time and is almost ubiquitous in the HPC world. The key reason for this is its efficiency in terms of the ability to use all available compute resources while limiting memory usage. This level of efficiency has always been beyond the scope of more “high-level” scripting languages such as Python*, Octave*, or R*. This is primarily because such languages are designed to be high-productivity environments that facilitate rapid prototyping―and are not designed to run efficiently on large compute clusters. However, there is no technical reason why this should be the case, and this represents a tooling problem for scientific computing.

Julia,* a new language for technical computing, is meant to address this problem. It reads like Python or Octave, but performs as well as C. It has built-in primitives for multithreading and distributed computing, allowing applications to scale to millions of cores.

The first natural questions that spring to mind are: Why can’t the other high-level languages perform this well? What’s unique about Julia that lets it achieve this level of efficiency while still allowing the user write readable code?

The answer is that Julia uses a combination of type inference and code specialization to generate tight machine code. To see how, let’s use the Julia macro @code_native, which generates the assembly code generated by a function call. For example:

julia> @code_native 1 + 1
      .section    __TEXT,__text,regular,pure_instructions
Filename: int.jl
      pushq %rbp
      movq %rsp, %rbp
Source line: 32
      leaq (%rdi,%rsi), %rax
      popq %rbp
      retq
Source line: 32
      nopw (%rax,%rax)

This example shows the assembly code generated when Julia adds two integers. Notice that Julia picks the right assembly instruction for the operation, and there is virtually no overhead.

The next code snippet shows the addition of two double-precision floating-point numbers:

julia> @code_native 1. + 1.
      .section __TEXT,__text,regular,pure_instructions
Filename: float.jl
      pushq %rbp
      movq %rsp, %rbp
Source line: 240
      addsd %xmm1, %xmm0
      popq %rbp
      retq
Source line: 240
      nopw (%rax,%rax)

Julia automatically picks a different instruction, addsd, since we are now adding two floating-point numbers. This is because two versions of the function “+” are generated depending on the type of the input, which was inferred by the compiler. This idea allows the user to write high-level code and then let Julia infer the types of the input and output, and subsequently compile a specialized version of that function for those input types. This is what makes Julia fast (Figure 1).

Figure 1 – Julia benchmark times relative to C (smaller is better). C performance = 1.0. Benchmark code can be found at the GitHub repository

Julia’s unique features solve the two-language problem in data science and numerical computing. This has led to demand for Julia in a variety of industries from aerospace to finance.

This led to the formation of Julia Computing, which consults with industry and fosters adoption of Julia in the community. One of Julia Computing’s flagship products is JuliaPro*, a Julia distribution that includes an IDE, a debugger, and more than 100 tested packages. JuliaPro will soon ship with Intel® Math Kernel Library (Intel® MKL) for accelerated BLAS operations and optimizations for multicore and the latest Intel® processors.

Parallel Computing in Julia

Julia has several built-in primitives for parallel computing at every level: vectorization (SIMD), multithreading, and distributed computing.

Consider an embarrassingly parallel simulation as an example. An iterative calculation of π involves generating random numbers between 0 and 1 and eventually calculating the ratio of the number of points that “land” inside the unit circle as opposed to those that don’t. This gives the ratio of the areas of the unit square and the unit circle, which can then be used to calculate π. The following Julia code makes use of the @parallel construct. The reducer “+” sums the values of the final expression that was computed in parallel on all Julia processes:

addprocs(2)   # Add 2 Julia worker processes

function parallel_π(n)
    in_circle = @parallel (+) for i in 1:n # <-- partition the work
        x = rand()
        y = rand()
        Int((x^2 + y^2) < 1.0)
    end
    return (in_circle/n) * 4.0
end

parallel_π(10000)

Julia offloads work to its worker processes that compute the desired results and send them back to the Julia master, where the reduction is performed. Arbitrary pieces of computation can be assigned to different worker processes through this one-sided communication model.

A more interesting use-case would be having different processes solve, for example, linear equations and then serialize their output and send it to the master process. Scalable machine learning is one such example which involves many independent backsolves. Figure 2 shows the performance of RecSys.jl, which processes millions of movie ratings to make predictions. It is based on an algorithm called Alternating Least Squares (ALS), which is a simple, iterative method for collaborative filtering. Since each solve is independent of every other, this code can be parallelized and scaled easily. Figure 2 is a performance chart that shows the scaling characteristics of the system.

Figure 2 – Scaling chart of the ALS Recommender System: “nprocs” refers to the number of worker processes in the shared memory and distributed setting and number of threads in the multithreaded setting

In Figure 2, we look at the scaling performance of Julia in multithreaded mode (Julia MT), distributed mode (Julia Distr), and shared memory mode (Julia Shared). Shared memory mode allows independent Julia worker processes (as opposed to threads) to access a shared address space. An interesting aspect of this chart is the comparison between Julia’s distributed capabilities and that of Apache Spark*, a popular framework for scalable analytics that is widely adopted in the industry. What’s more interesting, though, are the consequences of this experiment: Julia can scale well to a higher number of nodes. Let’s now move on to a real experiment in supercomputing.

Bringing It All Together (at Scale): Celeste*

The Celeste* Project is a collaboration of Julia Computing (Keno Fischer), Intel Labs (Kiran Pamnany), JuliaLabs@MIT (Andreas Noack, Jarrett Revels), Lawrence Berkeley National Labs (David Schlegel, Rollin Thomas, Prabhat), and University of California–Berkeley (Jeffrey Regier, Maximilian Lam, Steve Howard, Ryan Giordano, Jon McAuliffe). Celeste is a fully generative hierarchical model that uses statistical inference to mathematically locate and characterize light sources in the sky. This model allows astronomers to identify promising galaxies for spectrograph targeting, define galaxies for further exploration, and help understand dark energy, dark matter, and the geometry of the universe. The data set used for the experiment is the Sloan Digital Sky Survey (SDSS) (Figure 3), with more than five million images comprising 55 terabytes of data.

Figure 3 – A sample from the Sloan Digital Sky Survey (SDSS)

Using Julia’s native parallel computing capabilities, researchers were able to scale their application to 8,192 Intel® Xeon® processor cores on the National Energy Research Scientific Computing Center (NERSC) Cori* supercomputer at LBNL. This resulted in parallel speedups of 225x in image analysis, processing more than 20,000 images (or 250 GB), an increase of three orders of magnitude compared to previous iterations.

Note that this was achieved with Julia’s native parallel capabilities, which allowed scaling of up to 256 nodes and four threads per node. This kind of scaling puts Julia firmly in the HPC realm, joining the elite league of languages such as C/C++ and Fortran.

Pradeep Dubey of Intel Labs summarized this nicely: “With Celeste, we are closer to bringing Julia into the conversation because we’ve demonstrated excellent efficiency using hybrid parallelism―not just processes, but threads as well―something that’s still impossible to do with Python or R.”

The next step in the project is scaling even further and processing the entire data set. This will involve making use of the latest Intel® Xeon Phi™ processors and coprocessors.

Breaking New Boundaries

The Julia project has come a long way since its inception in 2012, breaking new boundaries in scientific computing. Julia was conceived as a language that retained the productivity of Python while being as fast and scalable as C/C++. The Celeste Project shows that this approach allows Julia to be the first high-productivity language to scale well on clusters.

The Julia project continues to double its user base every year and to be increasingly adopted in a variety of industries. Such advances in tooling for computational science not only bode well for scientific research to address the challenges of the coming age, but will also result in rapid innovation cycles and turnaround times in the industry.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex.