IOR

IOR is the de facto standard benchmark for parallel I/O performance in HPC environments. It is an MPI-based application that can generate I/O workloads in interleaved or random patterns.

I’ve written about the basics of using IOR in my Basics of I/O benchmarking blog post.
I’ve given a tutorial on how to benchmark with IOR at the Lustre User Group (video, slides)

Compressibility

By default, IOR generates data in a format that is very compressible. It generates 64-bit sub-blocks that contain:

32 bits which encode the MPI rank
32 bits which encode a random seed + a monotonically increasing offset

For example, you can run:

mpirun -n 4 src/ior -w -t 4096 -b 4096 -k
xxd testFile |less

and you will see…

miniblock number	contents	high 32 bits	low 32 bits
0x0	af93 0269 0000 0000	pretendRank = 0	rand_seed + i = 0x6902958d
0x12a0	0394 0269 0100 0000	pretendRank = 1	rand_seed + i = 0x69204930

There is very little entropy in these miniblocks across different MPI ranks by default. This causes misleading results on VAST systems with similarity and data reduction enabled, because the IOR pattern winds up compressing on to just a handful of individual drives rather than spreading across the whole cluster. To fix this, use -l random or -O dataPacketType=random. I wrote about this specific quirk of IOR+VAST for PDSW 2021.

Acceptance test examples

Below are some examples of how I’ve used IOR to validate some very large file systems.

Perlmutter (2019)

Perlmutter has a 36 petabyte all-flash Lustre file system composed of 274 OSSes and 16 MDSes.

Lustre Phase 1

There were bandwidth and IOPS tests run for Perlmutter’s file system acceptance.

Bandwidth

Bandwidth tests used 1,382 compute nodes and 4 processes per node. The node count was dictated by the requirement that 90% of compute nodes could achieve high performance. We didn’t require 100% of compute nodes because, in practice, it’s hard to get every single node up and running during the acceptance period since hardware is still new.

The write bandwidth test was run as

srun ./ior -w -t 1m -b 1m -s 100000 -a POSIX -F -e -g -vv -C -o /pscratch/IOR-strided.out -D 45

This resulted in a bandwidth of 3,427,179.21 MiB/s. Notably, I was able to observe significantly higher write bandwidth by playing tricks; for example, running using 1,024 nodes, 4 processes per node, and a significantly higher transfer size:

srun ./ior -w -F -e -g -vv -O lustrestripecount=1 -t 64m -b 64g -D 45 -k -o /global/homes/g/glock/testFile

yielded 4,435,971.20 MiB/s. These larger transfer sizes better utilize parallelization in RPCs which can increase overall performance, and I don’t know what the upper limit for the Phase 1 system was.

The read bandwidth test was run in two parts. First, a dataset was created that was large enough to take over 30 seconds to be read without hitting any ends-of-file. Once this dataset was created, it was then read using a 30-second stonewalled run:

# Generate dataset
srun ./ior -w -t 1m -b 1m -s 100000 -a POSIX -F -e -g -vv -C -o /pscratch/IOR-strided.out -k -O stoneWallingWearOut=1 -D 90
 
# Read dataset
srun ./ior -r -t 1m -b 1m -s 100000 -a POSIX -F -e -g -vv -C -o /pscratch//IOR-strided.out -k -D 30

The 30 seconds is significant because the acceptance criteria required that the performance be sustained for 30 seconds. Note that we do not use stonewalling wear-out on the second run because we are not trying to emulate the performance of an application that was writing the same amount of data from all MPI processes. Instead, we were testing how much performance the file system could sustain under any conditions.

This test resulted in a bandwidth of 3,818,284.19 MiB/s.

IOPS

IOPS tests used 230 nodes and 32 processes per node. The node count reflects 15% of the compute nodes which was the requirement since it reflects the most-likely case of a large number of small jobs all bursting random(ish) data to and from the file system at once. Like the bandwidth tests, this was not intended to emulate a single application’s I/O pattern; it was meant to demonstrate the capability of the file system under duress.

The write IOPS test was run similar to the write bandwidth test:

srun ./ior -w -t 4k -z -a POSIX -F -b 16g -e -g -vv -C -D 45 -o /pscratch/IOR-random.write.out

There are problems with this way of testing since Lustre implements write-back caching and random writes can be aggregated and reordered before they are sent over the network to Lustre servers. There is no effective way to measure write IOPS so we just ran a workload that emulated what a user would experience—write back caching and all—and got a commensurately large number. I would argue that this test was not well conceived when I crafted it, but you would need a lot of clients to drive a truly random write workload on the servers using O_DIRECT.

The performance from this test came back at 24,807,129.18 IOPS.

The read IOPS test was also run in two phases—first by generating a large dataset to be read (using 4 MiB sequential transfers so that the dataset would be large), then reading from it in 4 KiB transfers at random offsets:

# Generate dataset
srun ./ior -w -t 4m -k -a POSIX -F -b 16g -e -g -vv -C -D 45 -o /pscratch/IOR-random.read.out -O stoneWallingWearOut=1
 
# Read dataset
srun ./ior -r -t 4k -z -a POSIX -F -b 16g -e -g -vv -C -D 45 -o /pscratch/IOR-random.read.out

The performance of this test came back at 35,098,607.79 IOPS. It’s worth noting that this test was client-limited; using more than 230 clients (15% of the total node count) was found to drive this number significantly higher (116,434,587.92 read IOPS) using 1,024 clients.

Community File System (2019)

Phase I

The CFS Phase I file system was composed of seven IBM ESS GL8c appliances. Access between clients and servers was via FDR InfiniBand.

The IOPS test was run using

mpirun -n 184 --map-by node ./ior -w -r -z -e -C -F -t 4k -b 1g -o /global/cfs/iorfpprandfile

28 nodes, 8 processes per node
4 KiB transfers
1 GiB blocks
184 GiB total output size
random offsets, file-per-process access

The peak results were

551,934.28 write operations/sec
423,413.67 read operations/sec

The full performance test was run using

mpirun -n 408 --map-by node ./ior -w -r -e -C -F -t 1024k -b 32g -o /global/cfs/iorfppseqfile1MiB

51 nodes, 8 processes per node
1 MiB transfers
32 GiB blocks
12.75 TiB total output size
sequential, file-per-process access

The peak results were

184,743.00 MiB/sec (POSIX file-per-process write)
155,214.17 MiB/sec (POSIX file-per-process read)

Cori (2015)

The Cori acceptance test for both the Lustre file system (cscratch) and the burst buffer used IOR to obtain the peak numbers that were advertised.

Lustre Phase I

For the peak Lustre performance (Phase I), we did

./IOR -w -a POSIX -F -C -e -g -k -b 4m -t 4m -s 1638 -o $SCRATCH/IOR_file -v

960 nodes, 4 processes per node
4 MiB transfer and block size
24 TiB total write size (which determined the segment count)
-w and -r were run as separate sruns (to ensure cache was dropped)

./IOR -w -a MPIIO -c -C -g -b 8m -t 8m -k -H -v -s $((12*1024*1024/8/(960*32))) -o $SCRATCH/IOR_file

960 nodes, 32 processes per node
8 MiB transfers and block size
12 TiB total write size
-w and -r run separately
collective buffering explicitly disabled
stripe size set to 8 MiB
stripe count set to the total number of OSTs, from lfs df $SCRATCH

To summarize:

716,886.15 MiB/sec (POSIX file-per-process write)
646,835.37 MiB/sec (POSIX file-per-process read)
344,016.32 MiB/sec (MPI-IO shared-file write)
614,328.95 MiB/sec (MPI-IO shared-file read)

Lustre Phase II

The same tests were performed at Phase II acceptance, but the Lustre performance was diminished due to filling of OSTs. To summarize, 960 nodes gave:

562,701.01 MiB/sec (POSIX file-per-process write, KNL with 8 ppn, buffered I/O)
389,576.59 MiB/sec (POSIX file-per-process read, KNL with 8 ppn, buffered I/O)
624,666.33 MiB/sec (POSIX file-per-process write, KNL with 8 ppn, direct I/O)
397,262.65 MiB/sec (POSIX file-per-process read, KNL with 8 ppn, direct I/O)
478,170.92 MiB/sec (POSIX file-per-process write, Haswell with 4 ppn) - 66% of the Phase I acceptance
346,969.69 MiB/sec (POSIX file-per-process read, Haswell with 4 ppn) - 53% of the Phase I acceptance

There were no peak I/O numbers for MPI-IO shared-file I/O for Phase 2.

DataWarp Phase I

DataWarp Phase I used 4480 processes (ppn=4) with the following IOR command-line options:

./IOR -a MPIIO -g -t 512k -b 8g -o $DW_JOB_STRIPED/IOR_file -v
./IOR -a POSIX -F -e -g -t 512k -b 8g -o $DW_JOB_STRIPED/IOR_file -v
./IOR -a POSIX -F -e -g -t 4k -b 1g -o $DW_JOB_STRIPED/IOR_file -v -z

To summarize,

832,451.89 MiB/sec (POSIX file-per-process write)
862,616.35 MiB/sec (POSIX file-per-process read)
334,627.84 MiB/sec (MPI-IO shared-file write)
765,847.30 MiB/sec (MPI-IO shared-file read)
12,527,427.06 IOP/sec (POSIX file-per-process write)
12,591,977.74 IOP/sec (POSIX file-per-process read)

DataWarp Phase II

The Phase II IOR runs used between 44,000 and 44,080 processes (again, ppn=4) with the following IOR command-line options:

./IOR -a POSIX -F -e -g -t 1M -b 8G -o $DW_JOB_STRIPED/IOR_file -v
./IOR -a MPIIO -g -t 1M -b 8G -o $DW_JOB_STRIPED/IOR_file -v
./IOR -a POSIX -F -e -g -t 4k -b 1g -o $DW_JOB_STRIPED/IOR_file -v -z

To summarize, the peak numbers were

1,493,373.74 MiB/sec (POSIX file-per-process write)
1,663,914.47 MiB/sec (POSIX file-per-process read)
1,300,578.87 MiB/sec (MPI-IO shared-file write; independent I/O)
1,259,295.00 MiB/sec (MPI-IO shared-file read; independent I/O)
13,135,292.56 IOP/sec (POSIX file-per-process write)
28,260,132.42 IOP/sec (POSIX file-per-process read)

Glenn's Digital Garden

Explorer

IOR