1. Introduction
Although a compute node or workstation may appear to have 16 cores and 64 GB of DRAM, these resources are not uniformly accessible to your applications. The best application performance is usually obtained by keeping your code's parallel workers (e.g., threads or MPI processes) as close to the memory on which they are operating as possible. While you might like to think that the Linux thread scheduler would do this automatically for you, the reality is that most HPC applications benefit greatly from a little bit of help in manually placing threads on different processor cores.
To get an idea of what your multithreaded application is doing while it is
running, you can use the ps
command.
Assuming your executable is called application.x
, you can easily
see what cores each thread is using by issuing the following command in bash:
$ for i in $(pgrep application.x); do ps -mo pid,tid,fname,user,psr -p $i;done
The PSR field is the OS identifier for the core each TID (thread id) is utilizing.
2. Types of Thread Scheduling
Certain types of unevenly loaded applications can experience serious performance degradation caused by the Linux scheduler treating high-performance application codes in the same way it would treat a system daemon that might spend most of its time idle.
These sorts of scheduling issues are best described with diagrams. Let's assume we have compute nodes with two processor sockets, and each processor has four cores:
When you run a multithreaded application with four threads (or even four serial applications), Linux will schedule those threads for execution by assigning each one to a CPU core. Without being explicitly told how to do this scheduling, Linux may decide to
- run thread0 to thread3 on core0 to core3 on socket0
- run thread0 and thread1 on core0 and core1 on socket0, and run thread2 and thread3 on socket1
- run thread0 and thread1 on core0 only, run thread2 on core1, run thread3 on core2, and leave core3 completely unutilized
- any number of other nonsensical allocations involving assigning multiple threads to a single core while other cores sit idle
It should be obvious that option #3 and #4 are very bad for performance, but the fact is that Linux will happily schedule your multithreaded job (or multiple single-thread jobs) this way if your threads behave in a way that is confusing to the operating system.
2.1. Compact Scheduling
Option #1 is often referred to as "compact" scheduling and is depicted in the diagram to the right. It keeps all of your threads running on a single physical processor if possible, and this is what you would want if all of the threads in your application need to repeatedly access different parts of a large array. This is because all of the cores on the same physical processor can access the memory banks associated with (or "owned by") that processor at the same speed. However, cores cannot access memory stored on memory banks owned by a different processor as quickly; this is phenomenon is called NUMA (non-uniform memory access). If your threads all need to access data stored in the memory owned by one processor, it is often best to put all of your threads on the processor who owns that memory.
2.2. Round-Robin Scheduling
Option #2 is called "scatter" or "round-robin" scheduling and is ideal if your threads are largely independent of each other and don't need to access a lot of memory that other threads need. The benefit to round-robin thread scheduling is that not all threads have to share the same memory channel and cache, effectively doubling the memory bandwidth and cache sizes available to your application. The tradeoff is that memory latency becomes higher as threads have to start accessing memory that might be owned by another processor.
2.3. Stupid Scheduling
Option #3 and #4 are what I call "stupid" scheduling (see diagram to the right) and can often be the default behavior of the Linux thread scheduler if you don't tell Linux where your threads should run. This happens because in traditional Linux server environments, most of the proceses that are running at any given time aren't doing anything. To conserve power, Linux will put a lot of these quiet processes on the same processor or cores, then move them to their own dedicated core when they wake up and have to start processing.
If your application is running at full bore 100% of the time, Linux will probably keep it on its own dedicated CPU core. However, if your application has an uneven load (e.g., threads are mostly idle while the last thread finishes), Linux will see that the application is mostly quiet and pack all the quiet threads (e.g., t0 and t1 in the diagram to the right) on to the same CPU core. This wouldn't be so bad, but the cost of moving a thread from one core to another requires context switches which get very expensive when done hundreds or thousands of times a minute.
3. Defining affinity
There are several ways to specify how you want your threads to be bound to cores.
- If your application uses pthreads directly, you will have to use the "Linux-portable" methods (taskset or numactl) described below.
- If your application uses OpenMP, you can use the OpenMP runtime controls which are generally a lot nicer and more powerful.
3.1. The Linux-Portable Way (taskset)
If you want to launch a job (e.g., simulation.x
) on a
certain set of cores (e.g., core0
, core2
,
core4
, and core6
), issue
$ taskset -c 0,2,4,6 simulation.x
If your process is already running, you can define thread affinity while
in flight. It also lets you bind specific TIDs to specific processors at a
level of granularity greater than specifying -c 0,2,4,6
because
Linux may still schedule two threads on core2
and nothing on
core0
. For example,
$ for i in $(pgrep application.x);do ps -mo pid,tid,fname,user,psr -p $i;done PID TID COMMAND USER PSR 21654 - applicat glock - - 21654 - glock 0 - 21655 - glock 2 - 21656 - glock 2 - 21657 - glock 6 - 21658 - glock 4 $ taskset -p -c 0 21654 $ taskset -p -c 0 21655 $ taskset -p -c 2 21656 $ taskset -p -c 4 21657 $ taskset -p -c 6 21658
This sort of scheduling will happen under certain conditions, so specifying a set of cpus to a set of threads without specifically assigning each thread to a physical core may not always behave optimally.
3.2. The Other Linux-Portable Way (numactl)
The emerging standard for easily binding processes to processors on
Linux-based supercomputers is numactl
. It can operate on a
coarser-grained basis (i.e., CPU sockets rather than individual CPU cores) than
taskset (only CPU cores) because it is aware of the processor topology and how
the CPU cores map to CPU sockets. Using numactl
is typically
easier--after all, the common goal is to confine a process to a numa pool
(or "cpu node") rather than specific CPU cores. To that end,
numactl
also lets you bind a processor's memory locality to
prevent processes from having to jump across NUMA pools (called "memory nodes"
in numactl
parlance).
Whereas if you wanted to bind a specific process to one processor socket with
taskset
you would have to
$ taskset -c 0,2,4,6 simulation.x
the same operation is greatly simplified with numactl
:
$ numactl --cpunodebind=0 simulation.x
If you want to also restrict simulation.x
's memory use to the
numa pool associated with cpu node 0, you can do
$ numactl --cpunodebind=0 --membind=0 simulation.x
or just
$ numactl -C 0 -m 0 simulation.x
You can see what cpu nodes and their corresponding memory nodes are available on your system by using numactl -H:
$ numactl -H available: 2 nodes (0-1) node 0 size: 32728 MB node 0 free: 12519 MB node 1 size: 32768 MB node 1 free: 16180 MB node distances: node 0 1 0: 10 21 1: 21 10
numactl
also lets you supply specific cores (like
taskset
) with the --physcpubind or -C.
Unlike taskset
, though, numactl
does not appear to
let you change the CPU affinity of a process that is already running.
An alternative syntax to numactl -C
is something like
$ numactl -C +0,1,2,3 simulation.x
By prefixing your list of cores with a +
, you can have numactl
bind to relative cores. When combined with cpusets (which are enabled
by default for all jobs on Gordon), the above command will use the 0th, 1st,
2nd, and 3rd core of the job's given cpuset instead of literally core 0,1,2,3.
3.3. Using Non-Standard OpenMP Runtime Extensions
Multithreaded programs compiled with Intel Compilers can utilize Intel's Thread Affinity Interface for OpenMP applications. Set and export the KMP_AFFINITY env variable to express binding preferences. KMP_AFFINITY has three principal binding strategies:
compact
fills up one socket before allocating to other socketsscatter
evenly spreads threads across all sockets and coresexplicit
allows you define exactly which cores/sockets to use
Using KMP_AFFINITY=compact
will preferentially bind all your
threads, one per core, to a single socket before it tries binding them to
other sockets. Unfortunately, it will start at socket0
regardless of if other processes (such as another SMP job) is already bound to
that socket. You can explicitly specify an offset to force the job to bind to
a specific socket, but you need to know exactly what is running on what cores
and sockets on your node in order to specify this in your submit script.
You can also explicitly define which cores your job should use. Combined with a little knowledge of your system's CPU topology ([Intel's Processor Topology Enumeration tool][intel's processor enumeration tool] is great for this). If you wanted to run on cores 0, 2, 4, and 6, you would do
export KMP_AFFINITY='proclist=[0,2,4,6],explicit'
GNU's implementation of OpenMP has a environment variable similar to KMP_AFFINITY called GOMP_CPU_AFFINITY. Incidentally, Intel's OpenMP supports GOMP_CPU_AFFINITY, so using this variable may be a relatively portable way to specify thread affinity at runtime. The equivalent GOMP_CPU_AFFINITY for the KMP_AFFINITY I gave above would be:
export GOMP_CPU_AFFINITY='0,2,4,6'
3.4. Using OpenMP 4.0 Runtime Controls
Because KMP_AFFINITY
and GOMP_CPU_AFFINITY
turn out to be essential for
good performance in OpenMP applications, the OpenMP 4.0 standard introduced
a few environment variables to accomplish the same effect in a portable,
standardized way:
OMP_PROC_BIND
allows you to specify that you want to prevent threads from migrating between cores, and optionally, the binding strategy you wish to useOMP_PLACES
allows you specify a more explicit binding strategy
Specifically, OMP_PROC_BIND
can be set to one of the following values:
true
- don't prescribe a specific binding strategy, but when a thread is launched on a core, don't ever let it migrate to another corespread
- Evenly spread threads across all sockets and cores. This is the same asKMP_AFFINITY=scatter
above.close
- Fill up one socket with threads before allocating threads to other sockets. This is the same asKMP_AFFINITY=compact
The OMP_PLACES
environment variable allows you to specify groups of cores on
which you'd like your threads to be bound. This allows you to run an OpenMP
application on only a subset of the processor cores available. For example,
OMP_PLACES='sockets(1)'
- only allow the application to run on the cores provided by a single CPU socketOMP_PLACES='cores(4)'
- only allow the application to run on the hardware threads provided by four CPU cores. If each CPU core has multiple hardware threads (e.g., Intel HyperThreading), the application can still use all of the hyperthreads on all of the cores.OMP_PLACES='threads(16)'
- only allow the application to run on sixteen hardware threads
You can also use OMP_PLACES
to achieve the same effect as
KMP_AFFINITY=explicit
. For example,
OMP_PLACES='{0},{2},{4},{6}'
will allow OpenMP threads to run on cores 0, 2, 4, and 6.OMP_PLACES='{0,1},{2,3},{4,5},{6,7}'
will allow OpenMP threads to bind to four distinct places, where each place consists of two cores. OpenMP threads will then be able to migrate between the two cores in each place, but not between places.
There are more options and details than what I've listed, and NERSC has a good explanation of how to use OpenMP 4.0 Thread Affinity Controls. However, the easiest way to get a handle on exactly what these options do on your nodes is to experiment with them. Here is a very simple OpenMP script that reports on how threads are allocated to different cores that is very helpful to this end:
Try building it and running it with different values of OMP_PROC_BIND
and
OMP_PLACES
.
3.5. getfreesocket
I wrote a small perl script called getfreesocket
that uses KMP_AFFINITY=explicit
(or GOMP_CPU_AFFINITY)
and some probing of the Linux OS at runtime to bind SMP jobs to free processor
sockets. It should be invoked in a job script something like this:
#!/bin/bash
NPROCS=1
BINARY=${HOME}/bin/whatever
nprocs=$(grep '^physical id' /proc/cpuinfo | sort -u | wc -l)
ncores=$(grep '^processor' /proc/cpuinfo | sort -u | wc -l)
coresperproc=$((ncores/nprocs))
OMP_NUM_THREADS=$((NPROCS*coresperproc))
freesock=$(getfreesocket -explicit=${NPROCS})
if [ "z$freesock" == "z" ]
then
echo "Not enough free processors! aborting"
exit 1
else
KMP_AFFINITY="granularity=fine,proclist=[$freesock],explicit"
GOMP_CPU_AFFINITY="$(echo $freesock | sed -e 's/,/ /g')"
fi
export KMP_AFFINITY OMP_NUM_THREADS GOMP_CPU_AFFINITY
${BINARY}
This was a very simple solution to get single-socket jobs to play nicely on
the shared batch system we were using at the Interfacial
Molecular Science Laboratory. While numactl
is an easier way
to accomplish some of this, it still requires that you know what other processes
are sharing your node and on what CPU cores they are running. I've experienced
problems with Linux's
braindead thread scheduling so this getfreesocket
finds
completely unused sockets that can be fed into taskset
,
KMP_AFFINITY, or numactl
.
This is not as great an issue if your resource manager supports launching jobs within cpusets. Your resource manager will provide a cpuset, and using relative specifiers for numactl cores (e.g., numactl -C +0-3) will bind to the free socket provided by the batch environment. Of course, this will not specifically bind one thread to one core, so using KMP_AFFINITY or GOMP_CPU_AFFINITY may remain necessary.