HPL-MxP

HPL-MxP is a benchmark that is, at its surface, similar to HPL but uses reduced precision (FP16) arithmetic units with iterative refinement to produce a “FLOPS” measurement. It reports performance using the same FP64 FLOP count as classical HPL ( $\frac{2}{3} N^{3}$ ) divided by walltime, but it doesn’t actually do $\frac{2}{3} N^{3}$ floating-point ops. Thus, HPL-MxP answers the question, “how many FP64 FLOPS would have been required to solve this problem.”

Unlike HPL, HPL-MxP does not have partial pivoting; it generates matrices that are well-conditioned and therefore easier to solve using low precision and GMRES refinement than HPL. In addition, the lack of pivoting also reduces the amount of MPI communication required by this benchmark. As a result, HPL-MxP is a problematic benchmark because it is heavily lopsided towards rewarding ultra low precision above all else.

At its, core, there are two phases to the benchmark:¹

Phase	Purpose	Scaling with $N$
Factorization	Makes a low-precision estimate of the answer. Complete a full, low-precision LU factorization of the entire matrix to get an approximate solution.	$O (\frac{2}{3} N^{3})$
Refinement	Cleans up the low-precision answer to be high-precision. Use L and U factors as a preconditioner in GMRES to iteratively correct the approximate solution to FP64 accuracy. Performed using FP64.	$O (k N^{2})$

For very large runs (large $N$ ), factorization, the $N^{3}$ phase (factorization) dominates walltime, and the refinement ( $k \cdot N^{2}$ , where k is the number of refinement iterations) becomes vanishingly small. As a result, at scale, HPL-MxP effectively…

Spends all its time calculating a low-precision, inaccurate estimate answer
Spend a tiny amount of time cleaning up that answer to be correct

Because factorization takes up most of the benchmarking time, HPL-MxP is essentially a measurement of tensor core throughput at whatever precision the implementer chooses.

The result is that halving the precision of the factorization’s GEMM (e.g., from FP16 to FP8, which doubles tensor core throughput) roughly doubles the reported score while solving the same mathematical problem to the same accuracy. HPL-MxP can’t differentiate:

a machine that is fast at FP64-class work, and
a machine that simply has a large ratio of low-precision to high-precision throughput

FP8 submissions began appearing around SC25, it became clear that using anything lower than FP16 broke the benchmark rankings, and the rules changed to require FP16 or higher.²

[2509.19618v1] HPL-MxP Benchmark: Mixed-Precision Algorithms, Iterative Refinement, and Scalable Data Generation ↩
“The lowest allowed precision is 16-bit floating-point (available in the recent IEEE 754 standard).” Rules ↩

Glenn's Digital Garden

Explorer

HPL-MxP

Graph View

Glenn's Digital Garden

Explorer

HPL-MxP

Footnotes

Graph View