Introduction
Darshan is a very useful profiling tool that intercepts I/O calls within HPC applications to perform lightweight profiling. It's designed to be enabled by default in HPC environments so that all jobs produce Darshan I/O profiles, but as a result, it can be less straightforward to use if you just want to profile your own application without having an administrator install it site-wide.
This page contains some notes I've taken over the years of using it both for my own personal purposes and as a part of a site-wide deployment.
Building for Individual Use
Out of the box, Darshan is designed to be built for system-wide deployment and its official build documentation kind of presumes that. This is not great for debugging or testing new features or versions yourself, so I use the following recipe to build Darshan for personal use:
./configure --with-log-path-by-env=DARSHAN_OUTPUT_DIR,SLURM_SUBMIT_DIR,PWD \
--with-jobid-env=SLURM_JOBID \
--disable-cuserid \
--with-mem-align=8 \
--enable-mmap-logs \
--prefix=$HOME/apps.cori-knl/darshan-3.1.3 \
CC=mpicc
where
--with-log-path-by-env
provides a comma-separated list of environment variables that Darshan should scan at runtime to determine where it should drop its profiling log file. If you provide at leastSLURM_SUBMIT_DIR,PWD
, Darshan will always write its output to somewhere reasonable even when an explicit logfile location isn't given; this option is required because of Darshan's original design to retain all Darshan log files in a single system-wide repository. If you are using something other than Slurm, use your resource manager's equivalent (e.g.,PBS_JOBDIR
).--with-jobid-env
is required but can be any arbitrary environment variable as long as it is defined at runtime. It is used to uniquely identify the log file names generated by Darshan. Often something likeSLURM_JOBID
(for Slurm) orPBS_JOBID
(for PBSPro). In hand-spun environments, I've also usedRANDOM
.--disable-cuserid
makes Darshan not try to resolve uids to usernames. Apparently this is a requirement on Cray XC systems. You may not need it for regular clusters.--with-mem-align=8
is required.--enable-mmap-logs
enables a mode where temporary log files are created in /tmp while the job is running. If the job crashes (or otherwise never callsMPI_Finalize()
), these temporary logs can be collected post-hoc and re-assembled.--prefix
is whatever you want it to beCC=mpicc
should be CC=cc for Cray or CC=mpicc elsewhere
On heterogeneous Cray systems, you can cross-compile (e.g., for KNL) if you want using this:
--host=x86_64-knl-linux
is picked up by autoconf and activates cross-compilation. The-knl-
is an arbitrary string you can set.
That said, Darshan compiled for Haswell can be linked against an otherwise-KNL application binary and it'll work fine, so cross-compiling Darshan isn't necessary.
Relatedly, you should always compile Darshan with gcc even if you compile your MPI applications with other compilers (PGI, Intel, etc). All modern compilers will happily link against Darshan built using GCC, but the reverse is not true.
You do have to recompile a different version of Darshan for each MPI library you wish to use though. On a system that provides mpich, mvapich, and OpenMPI, you'd have a Darshan build for (gcc + mpich), (gcc + mvapich), and (gcc + OpenMPI).
On Cray systems, integrating Darshan is very easy because the installation
process generates modulefiles that work with the Cray build environment. They
can be found in the share/
subdirectory of the installation path and activated
by issuing, e.g.,
module use $HOME/apps.cori-knl/darshan-3.1.3/share/craype-2.x/modulefiles
module load darshan
Building for System-wide Deployment
Building Darshan for system-wide deployment is pretty straightforward. After setting up the global log repository, I've used the following configure line to build Darshan for site-wide use at NERSC:
./configure --with-mem-align=8 \
--with-log-path=/global/cscratch1/sd/darshanlogs \
--prefix=$INSTALL_DIR/3.1.7 \
--enable-mmap-logs \
--with-jobid-env=SLURM_JOB_ID \
--disable-cuserid \
--enable-group-readable-logs \
CC=cc
The big differences is that --with-log-path
is used instead of
--with-log-path-by-env
so that the output directory for the Darshan logs are
system-defined, not user-defined.
As with the instructions above, integrating Darshan into the Cray environment is trivial if you use the modulefiles that are build along with Darshan.
Regression Testing
There is a small regression test suite included with Darshan.
From Shane Snyder, a maintainer of Darshan:
The regression tests just compile and execute 8 simple test programs, then check their darshan logs to make sure counters are as expected. This can help detect bugs in Darshan's compilation wrappers, its runtime code, or its log parsing code. There are tests for C, C++, and Fortran to help detect any compiler-specific issues.
To invoke the tests, there is a script in
darshan_src_dir/darshan-test/regression
directory called 'run-all.sh
'. You invoke it as follows:./run-all.sh <darshan_install_directory> <test_output_directory> <test_platform_dir>
test_output_directory
is just where all of the scheduler files, darshan logs, benchmark output goes.
test_platform_dir
is the directory of system-specific scripts that do the regression testing. They are located in thedarshan_src_dir/darshan-test/regression
directory -- the one you would use at NERSC is thecray-module-nersc
directory.
It's generally good practice to run these after each Darshan install.
Partial Log Support
A major problem with Darshan in production is that it will not produce log files
unless an application calls MPI_Finalize()
, meaning that jobs that hit their
walltime will never drop a log even though they were profiled. To address this,
Darshan 3.1.1 included the --enable-mmap-logs
compile option which gets
Darshan to use a memory-mapped buffer in tmpfs to persist a partial log even if
the application crashes. This feature has no measurable performance impact (see
Snyder et al., 2016) and is generally a good thing to enable as long as you
clean out your tmpfs at the end of each job.
Collecting Partial Logs
The partial logs generated by an aborted job can be collected using something like
srun -N 16 -n 16 --ntasks-per-node=1 bash -c 'mv /tmp/$USER_*.darshan $SLURM_SUBMIT_DIR || /bin/true'
It's easiest to incorporate this functionality into an epilog script. Such an
example in a Slurm environment may have a task epilog script, $PWD/epilog.sh
,
containing:
#!/bin/bash
SENTINEL_FILE="/tmp/epilog_${SLURM_JOB_ID}"
if [ ! -f "$SENTINEL_FILE" ]; then
touch "$SENTINEL_FILE"
mv /tmp/*id${SLURM_JOB_ID}*.darshan $SLURM_SUBMIT_DIR || /bin/true
fi
and passed to Slurm via
srun --task-epilog=$PWD/epilog.sh -N 4 -n 32 ./my_mpi_app arg1 arg2 ...
It is important to note the following about such user epilog scripts with Slurm:
- the script is run after each srun, not at the end of the entire sbatch script
- the script is run one time for each process launched, so it will run many times on each node (make sure the contents of the script are race-proof!)
- the script MUST be readable-executable by root, so if it lives on a root-squashed file system, it must be world-read/executable
Demultiplexing Logs
Darshan partial logs will have the form
glock_ior_id3015917_mmap-log-13409646026002993551-21.darshan
where
- 3015917 is the Slurm jobid
- 13409646026002993551 is a random number that is unique to the srun/mpirun to which the process that generated this log belonged. This allows a single jobid that has multiple srun/mpiruns to retain unique logfile names.
- 21 is the MPI rank that generated the log file
These three values can be used to demultiplex giant directories full of partial logs.
Reconstructing Logs (darshan-merge)
The darshan-merge
utility has pretty simple syntax:
darshan-merge --output glock_ior_id3015917-13409646026002993551.darshan \
--shared-redux \
glock_ior_id3015917_mmap-log-13409646026002993551-*.darshan
where
--output glock_ior_id3015917-13409646026002993551.darshan
designates the file name for the reconstructed log we are creating--shared-redux
reduces common file records just like Darshan does by defaultglock_ior_id3015917_mmap-log-13409646026002993551-*.darshan
is the glob that hits only those logs relevant to a single mpiexec/srun invocation (see Demultiplexing Logs above)
Q: What happens when darshan-merge is not given all of the logs for a job?
A: It still works and generates a darshan log, but the missing counters are simply not represented. It would be helpful of darshan-merge could verify that the number of input files matches the nproc field in the header; right now there's no single way to verify that all of the logs were present.
Q: What happens when darshan-merge is (accidentally) given multiple jobs' logs?
A: It still works and generates a darshan log with all of the records combined for each module. The header appears to reflect the first log opened. In short the resulting log is technically valid, but logically nonsensical.
Q: Does darshan-merge rely on metadata encoded in file names?
No. You can call your partial logs whatever you want and darshan-merge will happily merge them.
Anonymizing Darshan Logs
Researchers often want access to large collections of Darshan logs to synthesize aggregate workloads. Because Darshan logs user names, command-line arguments, paths, and file names, it's not a good idea to share users' logs without first anonymizing them. Fortunately Darshan includes the tools to do just that.
Note that not all data is anonymized; for example, file system mount points are retained so that file records can be tied back to an underlying file system. However some file systems (e.g., DataWarp) contain identifiable information (e.g., job id) in their mount location.
The process to anonymize large quantities of logs is as follows.
Step 1. Make the directory structure for the outputs (the darshan script does not do this):
for i in $(range 1 12); do
mkdir -p 2016/$i
done
Step 2. Then for each month, mkdir the days, e.g.,
for i in $(seq 1 31); do
mkdir 2016/7/$i
done
Step 3. Then create a manifest of logs to convert, e.g.,
lfs find ./2016 -name \*.darshan > manifest.txt
This can probably be implemented within the darshan-convert-logs.pl
script,
and I'm sure the Darshan developers would love it if an enterprising contributor
submitted a pull request that implemented this.
Step 4. Make sure the log conversion script and its required helper binaries are
in place. This means copying (or linking) a few scripts into $PWD
:
ln -s ~/src/git/darshan/darshan-util/jenkins-hash-gen
ln -s ~/src/git/darshan/darshan-util/darshan-convert-logs.pl
ln -s ~/src/git/darshan/darshan-util/darshan-convert
Step 5. Then launch the log conversion script:
./darshan-convert-logs.pl 0 /dev/null ./manifest.txt ./
where
0
is the randomization seed used to hash personally identifiable information./dev/null
is a file containing annotations. I've never used this feature of the anonymizer, so I just pass it an empty file to skip the annotation part.manifest.txt
is the manifest you generated using thefind
command above../
is the root of the tree where anonymized logs will be produced.
Make sure your darshan-convert
tool matches the version of Darshan with which
you will want to parse these obfuscated logs. It is a good idea to edit the
darshan-convert-logs.pl
script and ensure that the paths to the binaries it
contains are the ones you intend to use.