Building and Managing Darshan

Introduction

Darshan is a very useful profiling tool that intercepts I/O calls within HPC applications to perform lightweight profiling. It's designed to be enabled by default in HPC environments so that all jobs produce Darshan I/O profiles, but as a result, it can be less straightforward to use if you just want to profile your own application without having an administrator install it site-wide.

This page contains some notes I've taken over the years of using it both for my own personal purposes and as a part of a site-wide deployment.

Building for Individual Use

Out of the box, Darshan is designed to be built for system-wide deployment and its official build documentation kind of presumes that. This is not great for debugging or testing new features or versions yourself, so I use the following recipe to build Darshan for personal use:

./configure --with-log-path-by-env=DARSHAN_OUTPUT_DIR,SLURM_SUBMIT_DIR,PWD \
            --with-jobid-env=SLURM_JOBID \
            --disable-cuserid \
            --with-mem-align=8 \
            --enable-mmap-logs \
            --prefix=$HOME/apps.cori-knl/darshan-3.1.3 \
            CC=mpicc

where

--with-log-path-by-env provides a comma-separated list of environment variables that Darshan should scan at runtime to determine where it should drop its profiling log file. If you provide at least SLURM_SUBMIT_DIR,PWD, Darshan will always write its output to somewhere reasonable even when an explicit logfile location isn't given; this option is required because of Darshan's original design to retain all Darshan log files in a single system-wide repository. If you are using something other than Slurm, use your resource manager's equivalent (e.g., PBS_JOBDIR).
--with-jobid-env is required but can be any arbitrary environment variable as long as it is defined at runtime. It is used to uniquely identify the log file names generated by Darshan. Often something like SLURM_JOBID (for Slurm) or PBS_JOBID (for PBSPro). In hand-spun environments, I've also used RANDOM.
--disable-cuserid makes Darshan not try to resolve uids to usernames. Apparently this is a requirement on Cray XC systems. You may not need it for regular clusters.
--with-mem-align=8 is required.
--enable-mmap-logs enables a mode where temporary log files are created in /tmp while the job is running. If the job crashes (or otherwise never calls MPI_Finalize()), these temporary logs can be collected post-hoc and re-assembled.
--prefix is whatever you want it to be
CC=mpicc should be CC=cc for Cray or CC=mpicc elsewhere

On heterogeneous Cray systems, you can cross-compile (e.g., for KNL) if you want using this:

--host=x86_64-knl-linux is picked up by autoconf and activates cross-compilation. The -knl- is an arbitrary string you can set.

That said, Darshan compiled for Haswell can be linked against an otherwise-KNL application binary and it'll work fine, so cross-compiling Darshan isn't necessary.

Relatedly, you should always compile Darshan with gcc even if you compile your MPI applications with other compilers (PGI, Intel, etc). All modern compilers will happily link against Darshan built using GCC, but the reverse is not true.

You do have to recompile a different version of Darshan for each MPI library you wish to use though. On a system that provides mpich, mvapich, and OpenMPI, you'd have a Darshan build for (gcc + mpich), (gcc + mvapich), and (gcc + OpenMPI).

On Cray systems, integrating Darshan is very easy because the installation process generates modulefiles that work with the Cray build environment. They can be found in the share/ subdirectory of the installation path and activated by issuing, e.g.,

module use $HOME/apps.cori-knl/darshan-3.1.3/share/craype-2.x/modulefiles
module load darshan

Building for System-wide Deployment

Building Darshan for system-wide deployment is pretty straightforward. After setting up the global log repository, I've used the following configure line to build Darshan for site-wide use at NERSC:

./configure --with-mem-align=8 \
            --with-log-path=/global/cscratch1/sd/darshanlogs \
            --prefix=$INSTALL_DIR/3.1.7 \
            --enable-mmap-logs \
            --with-jobid-env=SLURM_JOB_ID \
            --disable-cuserid \
            --enable-group-readable-logs \
            CC=cc

The big differences is that --with-log-path is used instead of --with-log-path-by-env so that the output directory for the Darshan logs are system-defined, not user-defined.

As with the instructions above, integrating Darshan into the Cray environment is trivial if you use the modulefiles that are build along with Darshan.

Regression Testing

There is a small regression test suite included with Darshan.

From Shane Snyder, a maintainer of Darshan:

The regression tests just compile and execute 8 simple test programs, then check their darshan logs to make sure counters are as expected. This can help detect bugs in Darshan's compilation wrappers, its runtime code, or its log parsing code. There are tests for C, C++, and Fortran to help detect any compiler-specific issues.

To invoke the tests, there is a script in darshan_src_dir/darshan-test/regression directory called 'run-all.sh'. You invoke it as follows:
./run-all.sh <darshan_install_directory> <test_output_directory> <test_platform_dir>
test_output_directory is just where all of the scheduler files, darshan logs, benchmark output goes.

test_platform_dir is the directory of system-specific scripts that do the regression testing. They are located in the darshan_src_dir/darshan-test/regression directory -- the one you would use at NERSC is the cray-module-nersc directory.

It's generally good practice to run these after each Darshan install.

Partial Log Support

A major problem with Darshan in production is that it will not produce log files unless an application calls MPI_Finalize(), meaning that jobs that hit their walltime will never drop a log even though they were profiled. To address this, Darshan 3.1.1 included the --enable-mmap-logs compile option which gets Darshan to use a memory-mapped buffer in tmpfs to persist a partial log even if the application crashes. This feature has no measurable performance impact (see Snyder et al., 2016) and is generally a good thing to enable as long as you clean out your tmpfs at the end of each job.

Collecting Partial Logs

The partial logs generated by an aborted job can be collected using something like

srun -N 16 -n 16 --ntasks-per-node=1 bash -c 'mv /tmp/$USER_*.darshan $SLURM_SUBMIT_DIR || /bin/true'

It's easiest to incorporate this functionality into an epilog script. Such an example in a Slurm environment may have a task epilog script, $PWD/epilog.sh, containing:

#!/bin/bash
SENTINEL_FILE="/tmp/epilog_${SLURM_JOB_ID}"
if [ ! -f "$SENTINEL_FILE" ]; then
    touch "$SENTINEL_FILE"
    mv /tmp/*id${SLURM_JOB_ID}*.darshan $SLURM_SUBMIT_DIR || /bin/true
fi

and passed to Slurm via

srun --task-epilog=$PWD/epilog.sh -N 4 -n 32 ./my_mpi_app arg1 arg2 ...

It is important to note the following about such user epilog scripts with Slurm:

the script is run after each srun, not at the end of the entire sbatch script
the script is run one time for each process launched, so it will run many times on each node (make sure the contents of the script are race-proof!)
the script MUST be readable-executable by root, so if it lives on a root-squashed file system, it must be world-read/executable

Demultiplexing Logs

Darshan partial logs will have the form

glock_ior_id3015917_mmap-log-13409646026002993551-21.darshan

where

3015917 is the Slurm jobid
13409646026002993551 is a random number that is unique to the srun/mpirun to which the process that generated this log belonged. This allows a single jobid that has multiple srun/mpiruns to retain unique logfile names.
21 is the MPI rank that generated the log file

These three values can be used to demultiplex giant directories full of partial logs.

Reconstructing Logs (darshan-merge)

The darshan-merge utility has pretty simple syntax:

darshan-merge --output glock_ior_id3015917-13409646026002993551.darshan \
              --shared-redux \
              glock_ior_id3015917_mmap-log-13409646026002993551-*.darshan

where

--output glock_ior_id3015917-13409646026002993551.darshan designates the file name for the reconstructed log we are creating
--shared-redux reduces common file records just like Darshan does by default
glock_ior_id3015917_mmap-log-13409646026002993551-*.darshan is the glob that hits only those logs relevant to a single mpiexec/srun invocation (see Demultiplexing Logs above)

Q: What happens when darshan-merge is not given all of the logs for a job?

A: It still works and generates a darshan log, but the missing counters are simply not represented. It would be helpful of darshan-merge could verify that the number of input files matches the nproc field in the header; right now there's no single way to verify that all of the logs were present.

Q: What happens when darshan-merge is (accidentally) given multiple jobs' logs?

A: It still works and generates a darshan log with all of the records combined for each module. The header appears to reflect the first log opened. In short the resulting log is technically valid, but logically nonsensical.

Q: Does darshan-merge rely on metadata encoded in file names?

No. You can call your partial logs whatever you want and darshan-merge will happily merge them.

Anonymizing Darshan Logs

Researchers often want access to large collections of Darshan logs to synthesize aggregate workloads. Because Darshan logs user names, command-line arguments, paths, and file names, it's not a good idea to share users' logs without first anonymizing them. Fortunately Darshan includes the tools to do just that.

Note that not all data is anonymized; for example, file system mount points are retained so that file records can be tied back to an underlying file system. However some file systems (e.g., DataWarp) contain identifiable information (e.g., job id) in their mount location.

The process to anonymize large quantities of logs is as follows.

Step 1. Make the directory structure for the outputs (the darshan script does not do this):

for i in $(range 1 12); do
    mkdir -p 2016/$i
done

Step 2. Then for each month, mkdir the days, e.g.,

for i in $(seq 1 31); do
    mkdir 2016/7/$i
done

Step 3. Then create a manifest of logs to convert, e.g.,

lfs find ./2016 -name \*.darshan > manifest.txt

This can probably be implemented within the darshan-convert-logs.pl script, and I'm sure the Darshan developers would love it if an enterprising contributor submitted a pull request that implemented this.

Step 4. Make sure the log conversion script and its required helper binaries are in place. This means copying (or linking) a few scripts into $PWD:

ln -s ~/src/git/darshan/darshan-util/jenkins-hash-gen
ln -s ~/src/git/darshan/darshan-util/darshan-convert-logs.pl
ln -s ~/src/git/darshan/darshan-util/darshan-convert

Step 5. Then launch the log conversion script:

./darshan-convert-logs.pl 0 /dev/null ./manifest.txt ./

where

0 is the randomization seed used to hash personally identifiable information.
/dev/null is a file containing annotations. I've never used this feature of the anonymizer, so I just pass it an empty file to skip the annotation part.
manifest.txt is the manifest you generated using the find command above.
./ is the root of the tree where anonymized logs will be produced.

Make sure your darshan-convert tool matches the version of Darshan with which you will want to parse these obfuscated logs. It is a good idea to edit the darshan-convert-logs.pl script and ensure that the paths to the binaries it contains are the ones you intend to use.