Building and Running IO500

This page is a work in progress that combines how-to and personal opinion. At some point I may divorce the two and turn the opinion part into a blog post. Stay tuned.

Compiling

Step 1. Edit Makefile in the top level directory and edit CC and CFLAGS to match the compiler and build parameters required

If you don't do this, you will get this error:

mpicc -std=gnu99 -Wall -Wempty-body -Werror -Wstrict-prototypes -Werror=maybe-uninitialized -Warray-bounds -g3 -lefence -I./include/ -I./src/ -I./build/pfind/src/ -I./build/ior/src/ -DVERSION="\"io500-sc20_v3-6-gd25ea80d54c7\"" -c -o verifier.o src/verifier.c
/bin/sh: mpicc: command not found
make: *** [Makefile:59: verifier.o] Error 127

Step 2. Run ./prepare.sh with CC= defined in the environment to match what you put in the Makefile above:

CC=cc ./prepare.sh

If you don't do this, you will get this error:

Building parallel find;
Using LZ4 for optimization
./compile.sh: line 22: mpicc: command not found

Edit ./prepare.sh itself and edit the build_ior and related build functions if needed.

Running

The benchmark will benchmark whatever file system owns the datadir config parameter in the ini file passed to it at launch time. By default, this is

[global]
datadir = ./datafiles

which benchmarks a path relative to $PWD. Similarly, it outputs its results to whatever is given as the resultdir path. By default, this is empty in config-minimal.ini and is equivalent to

[global]
resultdir = ./results

Edit the to point to whatever path(s) you want, then do something like

srun -N 4 \
     -n 64 \
     --qos regular \
     -C haswell \
     -t 30:00 \
     ./io500 config-minimal.ini

Alternatively submit it in batch mode, since it does run for a while (each test runs for five minutes).

To integrate with a batch environment and dynamic mounts (e.g., a burst buffer) you have to do a little bit of gymnastics since io500 only takes its config from a preformatted file.

This is how I run against DataWarp:

#!/usr/bin/env bash
#SBATCH -N 4
#SBATCH -n 64
#SBATCH --qos regular
#SBATCH -C haswell
#SBATCH -t 30:00
#SBATCH -A nstaff
#DW jobdw type=scratch access_mode=striped capacity=20TiB

CONFIG_FILE="$SLURM_SUBMIT_DIR/config-$SLURM_JOBID.ini"

cat <<EOF > "$CONFIG_FILE"
[global]
datadir = $DW_JOB_STRIPED
EOF

srun "$SLURM_SUBMIT_DIR/io500" "$CONFIG_FILE"

Notes about the way the io500 binary works:

It automatically scales the benchmark to match the nprocs it's given, and by default its paths are relative to whatever $PWD is. So it inherits a lot from the execution environment.
It uses stonewalling by default, so some of the input parameters may seem ridiculously large. For example IOR easy is configured to write over 9 TB by default.

Results

The results directory contains timestamped output directories, one per run.
This is pretty nice in that running the same io500 repeatedly does not wipe out the results of previous runs.

In each directory are two important summary files:

result_summary.txt - an easy, human-readable file with individual performance measurements and the IO500 score
result.txt - a machine-readable summary of results

The result.txt file is nice, but it labels everything as "score" without units. For reference,

test	score units
ior-easy-write	GiB/s
mdtest-easy-write	kIOPS
ior-hard-write	GiB/s
mdtest-hard-write	kIOPS
find	kIOPS
ior-easy-read	GiB/s
mdtest-easy-stat	kIOPS
ior-hard-read	GiB/s
mdtest-hard-stat	kIOPS
mdtest-easy-delete	kIOPS
mdtest-hard-read	kIOPS
mdtest-hard-delete	kIOPS

The way the final IO500 score is calculated is first

Taking the geometric mean of the GiB/s scores
Taking the geometric mean of the kIOPS scores
Taking the geometric mean of #1 and #2

This is done rather than taking the geometric mean of all individual scores so that metadata (kIOPS) are weighted equally with bandwidth (GiB/s).

Interpreting Results

Attributing any intellectual value to the final io500 score is unwise; this score metric is a figure of merit that carries a number of biases. Consider the following aspects of the IO500 combined score.

No physical meaning

Taking the geometric mean of GiB/s and kIOPS mixes units of measure in a nonsensical way. The score is expressed in units of "square root of gigabyte-kilo-operations per second" which has no meaning.

Arbitrary equivalence of GiB/s and kIOPS

The IO500 score also draws an arbitrary equivalency between the difficulty of achieving one gigabyte per second and one kilo-I/O operation per second of performance. For example, let's say your IO500 run achieves an overall bandwidth score of 1 GiB/s and overall IOPS score of 1 kIOPS when you first run it. Through heroic effort, you are then able to double the performance for all eight IOPS tests to achieve a score of 2 kIOPS--your score would go from 1.0 to 1.7. Because of the way the IO500 score works though, you could get that same exact score (1.7) by instead doubling the performance of your four bandwidth tests.

Is it a true statement that getting this additional 1 GiB/s of performance would be as easy as getting 1 kIOPS of performance improvement? Put into more real terms, is it easier to buy a file system that can deliver 1 TiB/s of bandwidth or 1 MIOPS? The former requires hundreds of SSDs and dozens of servers, but the latter can be achieved with dozens of SSDs and a single server. And the IO500 score treats them equivalently.

This becomes further complicated when you consider the client requirements to achieve 1 TiB/s versus 1 million IOPS. Adding more clients typically improves streaming bandwidth to independent files (ior-easy) but increases lock contention to shared files and directories (ior-hard and mdtest-hard). As a result, getting a high IO500 score requires finding the perfect balance of clients to servers that has enough parallelism to get decent bandwidth for ior-easy while not causing too much contention and dropping the IOPS tests' performance.

Since a kIOP is weighted the same as a GiB/s but modern flash devices deliver hundreds of kIOPS but only single-digit GiB/s, maximizing your IO500 score usually means sacrificing your bandwidth score to pump up your IOPS scores since they give you more mileage more easily.

Arbitrary classification of tests as bandwidth or IOPS

I am still mentally working through the ior-hard test and its choice of units. Don't quote me on any of the following.

One final aspect of the IO500 scoring scheme to consider is that it arbitrarily defines the ior-hard tests (performing 47 KB I/Os) as a bandwidth test instead of an IOPS test. At this transfer size, many file systems will be IOPS-limited, not bandwidth-limited. Since this test is scored as GiB/s but this test would perform poorly when expressed this way, ior-hard arbitrarily penalizes file systems that cast the intended test (unaligned-but-consistent data accesses) as an IOPS-bound problem (e.g., using locks) relative to those that solve this as a bandwidth problem (e.g., using log-structured writes).

This may be valid since dumping a fixed dataset to storage as quickly as possible (as this test does) is fundamentally a bandwidth problem, there are other aspects to log-structured file systems (such as asynchronous compaction activity) which IO500 does not test and does not penalize such file systems for in the same way it penalizes locking file systems.

Significance of scores will change over time

Because IO500 is a multidimensional score that arbitrarily equates GiB/s to kIOPS, the meaning of this aggregate score will change over time as the relative difficulty of getting more GiB/s instead of kIOPS from different hardware and file system technologies changes.

This is notably different than something like Top500 whose score has a fixed meaning that is not dependent on the relative difficulty of achieving one dimension of performance over another. 100 FLOPs is 100 FLOPs regardless of if it's 1990 or 2020, but the same cannot be said for 20 sqrt(GiB * kIOP)/sec. Technologies like persistent memory make it much easier to achieve huge IOPS while not really improving bandwidth at all; thus, it's far easier to post a staggeringly high IO500 score by achieving extremely high IOPS (using persistent memory) than it is to do the same by achieving extremely high bandwidth.

Inconsistent Motivation

IO-500's workloads fail to acknowledge that there are two motivations for benchmarking: understanding system-level performance and understanding application performance. Synthetic benchmarks and workloads do fine for the former, but application-derived workloads are important for the latter.

IO-500's ior-easy and mdtest benchmarks are tests of system capability in that they demonstrate the peak capability of a system using idealized patterns that real applications strive to generate. On the other hand, the ior-hard benchmark tests an arbitrary pattern that is neither representative of system capability nor any specific user application.

At the same time, the IO-500 benchmark lacks the standard 4K random read/write patterns on the grounds that those patterns do not represent any real workload. This is a poorly formed argument, as the other tests (with the exception of the find test) do not represent particular workloads either. Because IO-500 does not focus on either testing specific applications or system-level performance exclusively, the tests (and therefore the scores) are equally unfocused and arbitrary.

Corollary on Interpreting Scores

Don't put a lot of weight on the combined IO500 score of any particular storage system, because it has a view subtle biases in it that do not apply to all computing environments or workloads. Realize that the IO500 list, when sorted by aggregate score, reflects not only hardware and software capability, but a degree of willingness to sacrifice bandwidth scores to pump up IOPS scores. It is gamed to the degree that the top tests aren't necessarily run using client counts, job geometries, and overall configurations that reflect reality.

Instead, look at the scores individually to determine which scores matter most to you and the workloads you're having to handle. The IO500 list is extremely valuable in its transparency and multidimensionality, and its maintainers have created a lot of nice tools to make it easy for you sort by the metrics that matter to you. Make use of this, and be skeptical of anyone who boasts of their IO500 position as evidence that they have a good file system.