We can’t talk about the differences between hyperscale and exascale without first coming to a common understanding of what we mean by “hyperscale” and “exascale.”
I will assume that you already know what “exascale” means, since the HPC community has been talking about it for for the last 15-20 years. The effort culminated in a series of large supercomputers that have been deployed and operated by the governments of the US, Japan, and now some European nations. Though large by historic standards, they look more or less like the systems that preceded them—purpose-built, dense, liquid-cooled systems manufactured by Cray, Fujitsu, or Eviden built for a broad range of scientific applications. Physically, look like this:
When we talk about hyperscale AI supercomputers though, they look quite different. Foremost, they’re really big, and it’s easier to talk about their scale using satellite imagery rather than photos from inside of their datacenters. So here’s a photo of the same two supercomputers from above, both drawn to scale:
And, also drawn to scale, a typical hyperscale datacenter. This example is a standard Microsoft datacenter that houses a very large A100-based supercomputer. Carving out a capability on par with Perlmutter in this facility would take up this much space:
A100 was many hyperscalers’ entry point into AI supercomputing, and the supercomputer that was housed in this building took up the whole building and was the system on which the model that became ChatGPT was trained.
The way these LLMs were developed followed a predictable pattern, where the only way to produce a significantly better model was to create a transformer with significantly more parameters, then train it on significantly more data.
In practice, this meant that the only way to maintain the pace of AI model development was to build supercomputers with ever-increasing size and with the absolute latest and greatest GPUs. So, as the HPC community was transitioning from pre-exascale systems like Perlmutter to exascale systems like Frontier, so too were the hyperscalers.
The first hyperscale-exascale supercomputers were built using NVIDIA H100 GPUs, and Microsoft listed an HPL run that placed #3 on the Top500 list as a demonstration of its capability. Compared to Frontier, that HPL run ran at XYZ% of Frontier’s latest HPL run, and the GPUs filled out this much of the hyperscale datacenter.
To match the FP64 FLOPS of Frontier’s HPL run, the GPUs required would’ve taken up this much space.
But what distinguishes these hyperscale AI systems from traditional exascale systems is just that—scale—because these datacenters are not limited by the same space and power constraints that this community is. Here is the latest photo I could find of this hyperscale complex:
And given that each one of these buildings supports somewhere between 30-50 MW, and hyperscalers can and do fill out campuses like this, you can imagine the size of some of the largest AI supercomputers being deployed at the hyperscale.
Cynics would be quick to point out that these hyperscale systems are very space-inefficient, and that’s true from a FP64 FLOPS standpoint. But remember that these systems aren’t designed to run HPL; the generation of models trained on these H100 GPUs used 16-bit precision, and if we compare the BF16 capability of Frontier and Perlmutter to this hyperscale facility, we can see that the low-precision performance density of these hyperscale facilities isn’t so bad compared to Frontier.
Ultimately though, these supercomputers were built into what I’ll call first-generation hyperscale AI facilities. These datacenters evolved from the standard build used for cloud infrastructure, not supercomputers, so they