This page is where I’m cataloging differences in training versus inferencing. This is sort of related to HPC vs AI.

Power

From Characterizing Power Management Opportunities For Llms In The Cloud by Esha Chouke, Brijesh Warrior, and team:

Quote

LLM training is thus run on large physical clusters with high-bandwidth Infiniband or optical networks for fast communication. For example, OpenAI scaled up clusters to 7500 GPU servers to train LLMs like GPT-3

Quote

inference only performs forward passes through the model, operates on one or a few data samples per request, and consequently requires fewer compute resources and interconnects.

Quote

Across all inference models, during every iteration, the power usage patterns exhibit two distinct phases: power spikes in the beginning, and a stable, lower power consumption later. Power spikes consistently occur at the start of every inference re- quest, often going beyond GPU TDP. These spikes correspond to the compute-intensive prompt phases of LLMs, which processes all input tokens in parallel. Following the spike, the stable, lower power consumption phase corresponds to the sequential, auto-regressive token sampling.

Put simply, prefill is compute-bound and causes a sharp increase in power consumption. Once complete, decode is memory bandwidth-bound and consumes less power for longer.

Quote

(1) training has higher peak and average power draw compared to inference, (2) training incurs large swings in power consumption within short durations, up to 37.5% of the provisioned power capacity within 2 seconds, whereas inference only incurs a change of up to 9%, and (3) inference power consumption shows a diurnal pattern since it is an interactive workload; yet, over the course of a few seconds, its power usage remains relatively stable compared to training.

So for a 50 MW cluster, they observed the equivalent of a 18 MW swing in two seconds during training, but only 4 MW for inferencing.

Quote

power consumption does spike across the GPUs serving the same inference during the prompt phase (Insights 4 and 5), these spikes are not correlated across endpoints serving other inferences. This lack of correlation is due to the variation in arrival times and scheduling at cluster scale.

Inferencing at scale smears together all of the power spikes caused by prefill with the lower power draw of decode, resulting in a net smooth power consumption.