Thursday, July 19, 2012

Performance Considerations on NUMA Systems

UMA (Uniform Memory Access) and NUMA (Non-Uniform Memory Access) are shared memory architectures used in parallel computers.

In a UMA architecture, all processors share the physical memory uniformly. In this model, access time to a memory location is independent of which processor makes the request or which memory chip contains the transferred data.

In contrast, the system using NUMA architecture consists of multiple "nodes" of cpu's (processors) and/or memory which are linked together by a special high-speed network. All cpu's have access to all of memory, but a cpu's access to memory on the local or a nearby node is faster than to distant nodes. For more explanation see [1-3].

In this article, we will examine NUMA support in Linux kernels and Application Servers.

Memory Policy

Most modern OS kernels are NUMA aware and support different memory policies.

NUMA policy is concerned about putting memory allocations on specific nodes to let programs access it as fast as possible. The primary way to do this is to allocate memory for a thread on its local node and keep the thread running there (node affinity). This gives the best latency for memory and minimizes traffic over  the global interconnect.

There are four different memory policies supported in Linux:
  1. interleave
    • Interleave memory allocations on a set of nodes.
  2. preferred
    • Try to allocate on a node first.
  3. membind
    • Allocate on a specific set of nodes.
  4. localalloc (default)
    • Allocate on the local node.
The difference between bind and preferred is that bind will fail the allocation when the memory cannot be allocated on the specified nodes. preferred falls back to other nodes. Using bind can lead to earlier running out of memory and delays due to more swapping.

To find out how many nodes existing on your Linux System, try:


$ numactl --hardware
available: 2 nodes (0-1)
node 0 size: 48448 MB
node 0 free: 212 MB
node 1 size: 48480 MB
node 1 free: 13148 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10



$ numactl --hardware
available: 1 nodes (0-0)
node 0 size: 32284 MB
node 0 free: 12942 MB
node distances:
node   0
  0:  10


The above outputs are from two Linux systems.  The first system has two memory nodes (i.e., a NUMA system).  The second system has only one memory node (i.e., not a NUMA system).  For the last three lines in the outputs, they specify the access cost of reaching node M from node N (which is 10 or 20 units).  If the machine is a NUMA system, you should see an NxN matrix with access costs from one node to another.

Similarly, you can find what the current memory policy, cpu bind and memory bind are by typing:

$ numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
cpubind: 0 1
nodebind: 0 1
membind: 0 1


$ numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7
cpubind: 0
nodebind: 0
membind: 0

Note that the outputs are from the same systems and in the same order.  The first system has 24 CPU's.  The second system has 8 CPU's.

Running Applications with a Specific NUMA Policy

The numactl program allows you to run your application program on specific cpu's and memory nodes. It does this by supplying a NUMA memory policy to the operating system before running your program. For example,
   $numactl  --cpunodebind=0  --membind=0,1  myprog
runs program "myprog" on cpu 0, using memory on nodes 0 and 1.

NUMA Enhancements in Different VMs


As described in [3], certain applications seem not to scale easily on Non-Uniform Memory Architectures (NUMA) since the addition of CPU cores does not proportionately increase application performance.

To perform well on a NUMA architecture, the garbage collector threads should be structured in a beneficial way. Below we will examine the NUMA enhancements provided in HotSpot and JRockit.

HotSpot


Since JDK6 update 2, the NUMA-aware allocator can be turned on with the -XX:+UseNUMA  flag in conjunction with the throughput collector (which is the default in HotSpot or you can set it explicitly with -XX:+UseParallelGC flag)[4]. When you turn this feature on, JVM then divides the young generation into separate pools, one pool for each node. When a specific thread allocates an object, the JVM looks to see what node that thread is on and then allocates the object from the pool for that node.  By taking advantage of node locality and affinity, applications running on NUMA systems can scale better.

JRockit[5]


The fag –XX:BindToCPUs can be used to force JRockit to only use certain CPUs in the machine. 

For example, you can set JRockit CPU affnity to CPU 0 and 2 only: 
  • java –XX:BindToCPUs=0,2
For NUMA, a separate affnity fags exists for NUMA nodes (–XX:BindToNumaNodes) as well as a fag that can control the NUMA memory allocation policy. This enables the user to specify if JRockit should interleave allocated pages evenly across all NUMA nodes or bind them in the local node where the memory allocation takes place. A "preferred local" policy can also be selected, that is JRockit should try to use the local node, but interleaved allocation is also fne. 

For example, you can force local NUMA allocation. 
  • java –XX:NumaMemoryPolicy=strictlocal
Other values are preferredlocal and interleave.

Acknowledgement


Some of the writings here are based on the feedback from Igor Veresov, Jon Masamitsu and David Keenan. However, the author would assume the full responsibility for the content himself.

No comments: