There are two ways to approach benchmarking.

System-level capability: This tells you the best- and worst-case performance you can get out of the system regardless of whatever workload (or mix of workloads) you throw at it. It won’t capture the lived experience of day-to-day operations, but it will define what “good” and “bad” are for the system itself.

Application-level performance: This tells you what an end-user can expect from the system when running a specific, important workload. For capability systems that run one big job at a time or a low-diversity job mix, these are probably more useful. For mixed environments though, you have to pick a few workloads with different motifs (I/O patterns, compute kernels, network traffic patterns) to get a feel for how much workload-specific variation to expect.

Storage

See Understanding and Measuring I/O Performance, a tutorial I gave at LUG22.

System-level capability is measured using tools like IOR, mdtest, and elbencho.

Application-level performance is usually measured using tools like h5bench.

Most public-sector RFPs focus on system-level benchmarking, because they have to support very diverse workloads that change over time. The presumption is often that app readiness/modernization efforts will push all the key applications towards the system-level best-case performance by adopting best practices and I/O middleware.

IO500 is not a great benchmark to put into RFPs. I began documenting why I felt this way here, but in brief, it mixes system-level (ior-easy and md-easy) and application-level (ior-hard) together in a way that is easy to game.

I’m most familiar with the US DOE way of specifying storage benchmarks, and there are a couple:

  • NERSC-10 specified a capacity (120 PB) and a best-case bandwidth (20 TB/s), then said “tell us how much read+write bandwidth+IOPS we would get out of this.” They did not dictate how that performance was to be measured, giving bidders a lot of flexibility to define how they think about performance.
  • OLCF-6 specified a capacity (90x the HBM capacity), a best-case bandwidth (write 15% of the capacity in 1 minute), and a best-case small-file create (50K 32KiB files/sec). They also specified a few key workload patterns (single-client, application checkpoint, hero random) and said “tell us how much read+write bandwidth+IOPS we would get for each.”
  • ATS-5 specified a capacity (12.6x the system memory), a best-case bandwidth (25% of memory in 300 seconds), and another best-case bandwidth that is a function of system reliability (job mean time to interrupt / time to write 80% of memory must be greater than 200). They also then said “tell us how much read+write bandwidth+IOPS we would get out of this.”
  • ALCF-4 specified a capacity (60x the HBM capacity) and had a mix of “tell us how much performance we will get” for a few specific patterns and a few clear performance targets:
    • write 25% of the HBM using 2 MiB transfers from 90% of the compute nodes
    • best-case read/write bandwidth for a range of I/O sizes from 32K to 16M
    • random read/write bandwidth for a range of I/O sizes from 32K to 16M
    • create 100K 4 KiB files per second

None of these RFPs specified what benchmarking tool was to be used, which gave bidders more freedom and flexibility to demonstrate performance without over-indexing on the quirks of one tool over the other. They trusted that vendors will do the right thing in their choice of specific tool and patterns because dishonesty or incompetence is easy to sniff out during response evaluations. Letting vendors shoot themselves in the foot early on can save a lot of time.

Focusing on the concerns of the center, not the specific implementation to measure how they manifest, also allows bidders to show how smart/clever/creative they are. Just as flexibility lets bidders shoot themselves in the foot, it also lets them showcase how good of a partner they may be by deeply understanding what the center is trying to accomplish.

Compute

System-level capability is measured using tools like HPL and STREAM.

Application-level performance is measured using mini-apps and application benchmark suites. NAS Parallel Benchmarks are one example, but every procurement typically has its own set.

Network

System-level capability is measured using tools like OSU Micro-Benchmarks.

GPCnet measures performance under contention. This is a system-level benchmark, but it probes how system-level performance degrades under real-world conditions.

I’m not aware of application-level performance tests off the top of my head, as these tests are usually incorporated into application-level compute benchmarks.