This page is a work in progress that combines how-to and personal opinion. At some point I may divorce the two and turn the opinion part into a blog post. Stay tuned.
Step 1. Edit
Makefile in the top level directory and edit
match the compiler and build parameters required
If you don't do this, you will get this error:
mpicc -std=gnu99 -Wall -Wempty-body -Werror -Wstrict-prototypes -Werror=maybe-uninitialized -Warray-bounds -g3 -lefence -I./include/ -I./src/ -I./build/pfind/src/ -I./build/ior/src/ -DVERSION="\"io500-sc20_v3-6-gd25ea80d54c7\"" -c -o verifier.o src/verifier.c /bin/sh: mpicc: command not found make: *** [Makefile:59: verifier.o] Error 127
Step 2. Run
CC= defined in the environment to match what
you put in the Makefile above:
If you don't do this, you will get this error:
Building parallel find; Using LZ4 for optimization ./compile.sh: line 22: mpicc: command not found
./prepare.sh itself and edit the
build_ior and related build functions
The benchmark will benchmark whatever file system owns the
parameter in the ini file passed to it at launch time. By default, this is
[global] datadir = ./datafiles
which benchmarks a path relative to
$PWD. Similarly, it outputs its results
to whatever is given as the
resultdir path. By default, this is empty in
config-minimal.ini and is equivalent to
[global] resultdir = ./results
Edit the to point to whatever path(s) you want, then do something like
srun -N 4 \ -n 64 \ --qos regular \ -C haswell \ -t 30:00 \ ./io500 config-minimal.ini
Alternatively submit it in batch mode, since it does run for a while (each test runs for five minutes).
To integrate with a batch environment and dynamic mounts (e.g., a burst buffer) you have to do a little bit of gymnastics since io500 only takes its config from a preformatted file.
This is how I run against DataWarp:
#!/usr/bin/env bash #SBATCH -N 4 #SBATCH -n 64 #SBATCH --qos regular #SBATCH -C haswell #SBATCH -t 30:00 #SBATCH -A nstaff #DW jobdw type=scratch access_mode=striped capacity=20TiB CONFIG_FILE="$SLURM_SUBMIT_DIR/config-$SLURM_JOBID.ini" cat <<EOF > "$CONFIG_FILE" [global] datadir = $DW_JOB_STRIPED EOF srun "$SLURM_SUBMIT_DIR/io500" "$CONFIG_FILE"
Notes about the way the io500 binary works:
- It automatically scales the benchmark to match the nprocs it's given, and by
default its paths are relative to whatever
$PWDis. So it inherits a lot from the execution environment.
- It uses stonewalling by default, so some of the input parameters may seem ridiculously large. For example IOR easy is configured to write over 9 TB by default.
The results directory contains timestamped output directories, one per run.
This is pretty nice in that running the same io500 repeatedly does not wipe out the results of previous runs.
In each directory are two important summary files:
- result_summary.txt - an easy, human-readable file with individual performance measurements and the IO500 score
- result.txt - a machine-readable summary of results
The result.txt file is nice, but it labels everything as "score" without units. For reference,
The way the final IO500 score is calculated is first
- Taking the geometric mean of the GiB/s scores
- Taking the geometric mean of the kIOPS scores
- Taking the geometric mean of #1 and #2
This is done rather than taking the geometric mean of all individual scores so that metadata (kIOPS) are weighted equally with bandwidth (GiB/s).
Attributing any intellectual value to the final io500 score is unwise; this score metric is a figure of merit that carries a number of biases. Consider the following aspects of the IO500 combined score.
No physical meaning
Taking the geometric mean of GiB/s and kIOPS mixes units of measure in a nonsensical way. The score is expressed in units of "square root of gigabyte-kilo-operations per second" which has no meaning.
Arbitrary equivalence of GiB/s and kIOPS
The IO500 score also draws an arbitrary equivalency between the difficulty of achieving one gigabyte per second and one kilo-I/O operation per second of performance. For example, let's say your IO500 run achieves an overall bandwidth score of 1 GiB/s and overall IOPS score of 1 kIOPS when you first run it. Through heroic effort, you are then able to double the performance for all eight IOPS tests to achieve a score of 2 kIOPS--your score would go from 1.0 to 1.7. Because of the way the IO500 score works though, you could get that same exact score (1.7) by instead doubling the performance of your four bandwidth tests.
Is it a true statement that getting this additional 1 GiB/s of performance would be as easy as getting 1 kIOPS of performance improvement? Put into more real terms, is it easier to buy a file system that can deliver 1 TiB/s of bandwidth or 1 MIOPS? The former requires hundreds of SSDs and dozens of servers, but the latter can be achieved with dozens of SSDs and a single server. And the IO500 score treats them equivalently.
This becomes further complicated when you consider the client requirements to achieve 1 TiB/s versus 1 million IOPS. Adding more clients typically improves streaming bandwidth to independent files (ior-easy) but increases lock contention to shared files and directories (ior-hard and mdtest-hard). As a result, getting a high IO500 score requires finding the perfect balance of clients to servers that has enough parallelism to get decent bandwidth for ior-easy while not causing too much contention and dropping the IOPS tests' performance.
Since a kIOP is weighted the same as a GiB/s but modern flash devices deliver hundreds of kIOPS but only single-digit GiB/s, maximizing your IO500 score usually means sacrificing your bandwidth score to pump up your IOPS scores since they give you more mileage more easily.
Arbitrary classification of tests as bandwidth or IOPS
I am still mentally working through the ior-hard test and its choice of units. Don't quote me on any of the following.
One final aspect of the IO500 scoring scheme to consider is that it arbitrarily defines the ior-hard tests (performing 47 KB I/Os) as a bandwidth test instead of an IOPS test. At this transfer size, many file systems will be IOPS-limited, not bandwidth-limited. Since this test is scored as GiB/s but this test would perform poorly when expressed this way, ior-hard arbitrarily penalizes file systems that cast the intended test (unaligned-but-consistent data accesses) as an IOPS-bound problem (e.g., using locks) relative to those that solve this as a bandwidth problem (e.g., using log-structured writes).
This may be valid since dumping a fixed dataset to storage as quickly as possible (as this test does) is fundamentally a bandwidth problem, there are other aspects to log-structured file systems (such as asynchronous compaction activity) which IO500 does not test and does not penalize such file systems for in the same way it penalizes locking file systems.
Significance of scores will change over time
Because IO500 is a multidimensional score that arbitrarily equates GiB/s to kIOPS, the meaning of this aggregate score will change over time as the relative difficulty of getting more GiB/s instead of kIOPS from different hardware and file system technologies changes.
This is notably different than something like Top500 whose score has a fixed meaning that is not dependent on the relative difficulty of achieving one dimension of performance over another. 100 FLOPs is 100 FLOPs regardless of if it's 1990 or 2020, but the same cannot be said for 20 sqrt(GiB * kIOP)/sec. Technologies like persistent memory make it much easier to achieve huge IOPS while not really improving bandwidth at all; thus, it's far easier to post a staggeringly high IO500 score by achieving extremely high IOPS (using persistent memory) than it is to do the same by achieving extremely high bandwidth.
IO-500's workloads fail to acknowledge that there are two motivations for benchmarking: understanding system-level performance and understanding application performance. Synthetic benchmarks and workloads do fine for the former, but application-derived workloads are important for the latter.
IO-500's ior-easy and mdtest benchmarks are tests of system capability in that they demonstrate the peak capability of a system using idealized patterns that real applications strive to generate. On the other hand, the ior-hard benchmark tests an arbitrary pattern that is neither representative of system capability nor any specific user application.
At the same time, the IO-500 benchmark lacks the standard 4K random read/write patterns on the grounds that those patterns do not represent any real workload. This is a poorly formed argument, as the other tests (with the exception of the find test) do not represent particular workloads either. Because IO-500 does not focus on either testing specific applications or system-level performance exclusively, the tests (and therefore the scores) are equally unfocused and arbitrary.
Corollary on Interpreting Scores
Don't put a lot of weight on the combined IO500 score of any particular storage system, because it has a view subtle biases in it that do not apply to all computing environments or workloads. Realize that the IO500 list, when sorted by aggregate score, reflects not only hardware and software capability, but a degree of willingness to sacrifice bandwidth scores to pump up IOPS scores. It is gamed to the degree that the top tests aren't necessarily run using client counts, job geometries, and overall configurations that reflect reality.
Instead, look at the scores individually to determine which scores matter most to you and the workloads you're having to handle. The IO500 list is extremely valuable in its transparency and multidimensionality, and its maintainers have created a lot of nice tools to make it easy for you sort by the metrics that matter to you. Make use of this, and be skeptical of anyone who boasts of their IO500 position as evidence that they have a good file system.