MLPerf Storage, IO500, and every other storage benchmarking effort, are driven storage practitioners trying to create benchmarks for other storage practitioners that are meant to be relevant to end users of storage. However, there are no end users involved, so the resulting benchmarks wind up being out of touch and out of date with what workloads will really need. For example,

  • IO500’s ior-hard test writes data 47008 bytes at a time. No real workload does this, but by rewarding high scores to storage systems that can do this well, IO500 sends the message that log-structured file systems (which handle these unaligned patterns well) are superior to everything else. This is why DAOS (and experimental log-structured file systems) always dominate the top of IO500. However, handling 47008-byte I/Os well has little correlation with a good user experience when running HPC/AI workloads.
  • Similarly, MLPerf Storage tests arbitrary I/O patterns; for example, UNet3D reads ~128 KB at a time from hundreds/thousands of ~100 MB files. This implies that that real AI workloads do this, but they do not. Real AI workloads bundle tokens into whatever chunk sizes load the fastest. Furthermore, most training workloads prefetch input tokens to local SSDs, so even if a training run did read 128K at a time, it wouldn’t be from shared storage. It would be from an XFS file system running on a local SSD, and the shared storage would see read traffic that was optimized for whatever I/O size gave the best bandwidth.
  • Even testing peak bandwidth is fraught, because these tests tend to write a bunch of zeros, then immediately read them back. Real workloads don’t do this; they write data, spend time computing, then read it back much later. Sophisticated storage systems (like VAST) use this delay to repack data, compress out zeroes, and garbage collect. As a result, simple benchmarks (write-then-read) don’t reflect the performance optimizations that real workloads (write-compute-read) achieve.

The issue is that the way in which AI interacts with storage is relatively arbitrary. Like HPC practitioners, leading AI practitioners shape I/OS to match whatever offers the best performance. For example, a benchmark might read in data using multiple whole files because that’s what an AI framework like PyTorch does. However, if that pattern performed poorly during real model training, the AI model framework would be changed to perform its reads in whatever pattern offered the best performance

Storage benchmarks work in enterprise because enterprise applications typically do not get tailored to optimize for storage; the opposite happens, and storage vendors optimize their platforms to provide the best performance for enterprise applications. This is not true in HPC and AI. AI people know this, but storage people (generally) do not.

The reason benchmarks like MLPerf Storage are popular is because their goal isn’t to actually make AI workloads faster; it’s to allow infrastructure people to make infrastructure decisions without learning anything about the workloads that will run on them. This sounds cynical (and it is), but it’s not realistic for everyone making storage decisions to also be experts in HPC/AI.

So what’s better?

I advocate for outcome-based performance tests. In the AI world, this may mean running an end-to-end inferencing workload on a system with real GPUs; although costly, it represents what end users, not their storage admins, experience.

This often leads to much more impactful improvements than what any synthetic benchmark suite can show you. For example, end-to-end testing with a major inferencing stack revealed that its KV block manager was issuing tiny 4K writes because that’s what the quanta of data movement was in GPU memory.

Needless to say, issuing a bunch of 4K writes resulted in poor performance. A synthetic benchmark would’ve either falsely claimed that 4K write performance is essential to KV caching (it is not), or would’ve show a great performance number that real-world inferencing would never achieve.

Testing the performance through the desired outcome (fast KV cache offloading) revealed a problem, and it was straightforward to issue a patch to the inferencing framework that coalesced those 4K writes into megabyte I/Os that achieved much better overall bandwidth and therefore better inferencing performance. This was an I/O problem, yet synthetic benchmarks could neither detect it nor make users’ lived experience better.