Measuring the Impact of NUMA Migrations on Performance
Weighing the Tradeoffs to Maximize Performance
These days, memory systems use non-uniform memory access (NUMA) architectures, where cores and the total DRAM are divided among sockets. Each core can access the whole memory as a single address space. However, accessing the memory local to its local socket is faster than the remote socket―hence the non-uniform memory access. Because of the different access latency, access to the local socket memory should always be preferred.
To achieve this, the Linux* kernel does NUMA migrations, which try to move memory pages to the sockets where the data is being accessed. Linux maintains bookkeeping information―such as the number of memory accesses to the pages from a given socket and latency of accesses―to make decisions regarding page migration. NUMA migrations in Linux are enabled by default unless an OSlevel NUMA allocation policy is specified using utilities such as numactl.
NUMA page migrations can be very useful in scenarios where multiple applications are running on a single machine, each with its own memory allocation. In such a multi-application scenario, where the system is being shared, it makes sense to move memory pages belonging to a particular application closer to the cores assigned to that application.
In this article, we’ll argue that if a single application is using the entire machine―which is the most common scenario for high-performance applications―NUMA migrations can actually hurt performance. Also, using application-level NUMA allocation policies is often preferred over OSlevel utilities such as numactl because they give finer control over the allocation of different data structures and design allocation policies.
We’ll look at two application-level NUMA allocation polices (Figure 1):
- NUMA interleave, in which memory pages are equally distributed among NUMA sockets in round-robin fashion (similar to the numactl -interleave all command).
- NUMA blocked, in which equal chunks of the allocated memory are divided among NUMA sockets.
Evaluation on Intel® Xeon® Gold Processors
We’ll evaluate the efficacy of NUMA migrations using a simple microbenchmark that allocates m amount of memory (using both NUMA interleaved and blocked policies) and writes to each location once using t threads such that each thread gets a contiguous block to write sequentially.
The pseudocode code memory allocation policies and simple computation are shown in Figure 2 and Figure 3, respectively. The experiments are conducted on a four-socket system with Intel® Xeon® Gold 5120 processors (56 cores with a clock rate of 2.2 GHz and 187GB of DDR4 DRAM). Hyperthreading was disabled during our evaluation.
Effect of NUMA Migration for Different NUMA Allocation Policies
Figure 4 shows the time of the microbenchmark using t = 56 threads and interleaved allocation as memory allocation size (m) increases (Figure 1). Doubling the workload doubles the execution time, which is expected. However, the number of pages migrated during execution also increases significantly. We observe a similar pattern for NUMA blocked allocation (Figure 5). However, blocked allocation gives better performance because no page migration is required up to a workload size of 40GB. The memory pages are allocated and accessed locally during the computation.
Effect of NUMA Migration on a Single Socket
Figure 6 shows the total time taken with a 160GB workload using different numbers of threads on a single socket, as well as the time spent in user code and kernel code. Since the total memory is equally divided among sockets, each socket will have approximately 47GB of memory (187GB divided among four sockets). We allocated 160GB across all four sockets. The microbenchmark scales with the number of threads for both allocation policies. Increasing the number of threads decreased execution time, which in turn reduces the number of pages migrated because the longer an application runs, the more pages will be migrated by the OS kernel.
The red part of the stacked plots shows the time spent in the kernel code to migrate pages. This is reduced to almost zero when NUMA migration is disabled. The geomean speedup gained by turning off NUMA migrations is 2.4x for interleave and 1.6x for blocked, which shows that NUMA migration has a significant impact on performance.
The Effect of NUMA Migration across Multiple Sockets
A pattern similar to a single socket (Figure 6) is also observed as we go beyond one socket (Figure 7). Each chart shows the performance with and without NUMA migration as the number of sockets increases. All the cores on the sockets are used. Note that the time spent in the kernel is always reduced when NUMA migration is disabled. Another interesting thing to note is that the time spent in the user code increases slightly when NUMA migration is disabled, indicating that NUMA migrations reduce memory access latency. However, the overhead of NUMA migrations can outweigh the benefits and end up hurting overall performance.
From our results, we can conclude that OS-level features such as NUMA migrations must be used with caution because they can have significant performance overhead, especially for single applications running on the entire machine, the most common scenario for high-performance computations.
The effect of NUMA migrations on the runtime of an application depends on various factors such as:
- Type of NUMA allocation policies used
- Number of sockets used on the processor
To avoid the performance noise introduced by NUMA page migrations, ensure that such OS-level features are turned off (NUMA migrations are on by default) as shown in Figure 8: