Getting Started with md-workbench

md-workbench generates a semi-synchronous metadata-intensive workload that was designed to mimic what a parallel compilation may look like to a file system. It runs in three phases which are described below.

Let's take a look at what a simple md-workbench does:

mpirun -n 2 md-workbench -I 7 -P 11 -D 3

According to the manual,

-I is the number of "objects" (files) to manipulate per "data set" (directory)
-P is the number of objects to precreate per data set
-D is the number of data sets to manipulate per process

Do not run md-workbench with the default parameters because they reflect a benchmark that will run for a very long time.

Phase 1 - Precreate phase

This phase can be isolated by specifying the -1 or --run-precreate option.

Rank 0 does

mkdir out/0_0
mkdir out/0_1
mkdir out/0_2

and rank 1 does

mkdir out/1_0
mkdir out/1_1
mkdir out/1_2

Then rank 0 creates a bunch of files:

open(out/0_0/file-0, O_CREAT)
write 3901 bytes to this file
close this file
...
open(out/0_0/file-10, O_CREAT)
write 3901 bytes to this file
close this file

then repeat this for out/0_1 and out/0_2.

Then there's a barrier.

As you can see, we created three directories per rank because of -D 3 and eleven files per directory because of -P 11. The -I 7 does not play a role here.

The 3901-byte file size can be changed using -S or --object-size, and this phase can be run multiple times using -R or --iterations.

Phase 2 - Benchmark phase

This phase can be isolated by specifying the -2 or --run-benchmark option.

Rank 0 does:

stat(out/1_0/file-0)
open(out/1_0/file-0)
read 3901 bytes from $WORKDIR/out/1_0/file-0 - this is an absolute path, not a relative one
close($WORKDIR/out/1_0/file-0)
unlink(./out/1_0/file-0)
open(./out/1_0/file-10, O_CREAT)
write 3901 bytes to $WORKDIR/out/1_0/file-10
close($WORKDIR/out/1_0/file-10)

This is repeated for file-0 and file-10 in different directories for a total of 21 times, or -I times -D. This is to say, for each directory:

a file is statted, opened, read, closed, and deleted
a new file is created written, and closed

There is no net creation or destruction of files, but files are being created and destroyed repeatedly. The mapping of MPI ranks to directories/files here is shuffled relative to Phase 1. Stay tuned for more information on how this shuffling is determined.

Then there's a barrier.

Note that -P plays no role here; it is solely for Phase 1. However the first round of open-read-close-unlink in Phase 2 depends on precreated files which are generated by Phase 1, so you should make sure that -P is the same as -I. If you don't do this, Phase 2 will try to open files that weren't precreated and categorize these operations as errors on the first round. I've had this cause both harmless warnings and a full job failure, and I'm not sure what circumstances lead to what. To be safe, just always specify both -P and -I for all phases.

Also, the default number of iterations is 3 (-R 3) which means this test will run three times before completing. It's a good idea to specify -R 1 if you want the test to complete quickly.

Phase 3 - Cleanup phase

This phase can be isolated by specifying the -3 or --run-cleanup option.

Rank 0:

unlinks the seven files in ./out/0_0/
rmdir(./out/0_0/)

Repeat for 0_1/ and 0_2/.

Rank 1 does the same for its directories and files from Phase 1.

Understanding Output

The default output of md-workbench is not labeled very well. It looks something like this (but note that I adding some line breaks for clarity):

benchmark process
max:60.73s min:60.08s mean: 60.45s balance:98.9 stddev:0.2
rate:2420.7 iops/s objects:36750 rate:605.2 obj/s tp:4.5 MiB/s op-max:4.7359e-01s (0 errs)
stonewall-iter:32
read(6.6781e-04s, 1.2720e-03s, 2.1476e-03s, 4.7331e-03s, 9.7501e-03s, 2.6300e-02s, 9.0354e-02s)
stat(5.8293e-04s, 1.3051e-03s, 2.2000e-03s, 4.6260e-03s, 8.8098e-03s, 2.3064e-02s, 9.9011e-02s)
create(3.1090e-03s, 2.6335e-02s, 4.2686e-02s, 7.1706e-02s, 1.0507e-01s, 1.8351e-01s, 4.7359e-01s)
delete(1.4300e-03s, 9.6769e-03s, 1.4584e-02s, 2.1045e-02s, 3.1489e-02s, 7.1808e-02s, 3.3352e-01s)

Let's break this down. First are the basic statistics:

max, min, and mean reflect the wall seconds used by the slowest and fastest MPI ranks
balance is the fastest rank's time divided by the slowest rank's time (in percent)
std is the standard deviation of all ranks' time

Then the benchmark rate summaries:

rate (for iops/s) is the number of successful open/read/close/unlink/create+open/write/close cycles successfully completed divided by walltime. To express this rate in IOPS, it multiplies the number of successful cycles by four I/O operations. md-workbench considers one cycle to be four I/O operations (write, stat, read, delete), but I don't agree. You can ultimately divide rate by four to get the cycle rate, then multiply it by whatever number of I/O operations per cycle you care to use.
objects are the number of objects (files) successfully manipulated
rate (for obj/s) is objects divided by time - this is the same as the cycle rate and should be exactly 0.25 times the rate for iops/s discussed above.
tp is the number of bytes successfully read and written over the whole benchmark phase divided by overall walltime. It is literally (objects created + objects read) multiplied by the mebibytes-per-object and divided by walltime.
op-max is the time taken by the slowest single operation (stat, create, read, close, etc) by any MPI rank. Not a terribly useful metric, but it tells you if a single operation on a single MPI rank dominated the overall walltime.

The stonewall-iter is how many cycles successfully complete. This value will never exceed whatever you specified for -I.

Finally, statistics are shown for each of the I/O operations per cycle (read/stat/create/delete). They take the form

opname(min, q1, median, q3, q90, q99, max)

which is pretty self-explanatory:

opname denotes the timing for stat/create/read/close
min - fastest time to complete the I/O operation
q1 - time taken to complete the op corresponding to first quartile
median - the median operation time
q3 - time taken to complete the op corresponding to third quartile
q90 - time taken to complete the op corresponding to the 90th percentile - these are going to be pretty slow
q99 - time taken to complete the op corresponding to the 99th percentile - these are the long-tail stragglers
max - slowest time to complete the I/O operation

If you specify --print-detailed-stats, you get a nice columnar summary of the benchmark phases' performance:

phase       d name  create  delete  ob nam  create  read    stat    delete  t_inc_b t_no_bar    thp max_t
benchmark   0   0   20075   20075   20075   20075   62.191s 62.191s 2.40 MiB/s 1.2642e+00

but strangely, it silences the other statistics for each operation.

Stonewalling

md-workbench supports stonewalling via the -w option, but if you run your phases as separate jobs using -1 and -2 explicitly, Phase 2 will only run with stonewalling if wear-out (-W) is also specified. I think this is because md-workbench cannot store and recall the progress of each MPI rank after Phase 1, so Phase 2 does not know how many files each rank should expect to touch.

Omitting I/O

You can make md-workbench only test metadata operations by specifying -S 0. This sets the file size to 0 bytes, and md-workbench is smart enough to simply never call read(2) or write(2) during the precreate and benchmark phases. I argue that this is not a realistic test since from a user perspective since there are few reasons to open a file without not performing some I/O on it, but it is a good way to drive load on only the metadata subsystem on file systems that separate data from metadata (like Lustre).