LLM training at scale

Unique challenges arise when performing LLM training at the scale required for frontier models, and there is no end in sight for training increasingly large models to get higher quality results.¹

Multi-data center training

See multicluster training.

Silent data corruption

See silent data corruption.

Asynchronous training

Asynchronous training is a nascent technique to allow training to occur where model weights are not globally synchronized continuously. This allows parts of training to continue even if a single GPU, node, or data-parallel model replica fails. Dylan Patel wrote a good summary of this.²

Convergence

Large language models are not trained to convergence. Instead, empirical scaling laws are used to determine the ideal number of training tokens required for a model of a given size, then this amount of data is used to train the model. From the PaLM 2 Technical Report:

Scaling law experiments

Scaling Transformer language models has become a popular way to achieve state-of-the-art performance. Kaplan et al. (2020) studied the relationship between scaling the amount of training data (D) and model size (N), and reached the empirical conclusion that it follows a power law, with N needing to grow faster than D. Hoffmann et al. (2022) built upon this observation with a similar study that tuned smaller models’ hyperparameters better. Their results corroborated Kaplan et al. (2020)’s power law conclusion; however, they arrived at different results regarding the optimal ratios, showing that N and D should instead grow in equal proportions.

No such scaling laws exist for training foundation models for science which forces that field to do extensive hyperparameter optimization prior to training. This is one of the places where foundation models for science are far behind the state of the art.

Resilience

GPUs and servers are constantly failing when training at scale.

availability discusses the techniques for minimizing job downtime caused by crashes and subsequent restart time.
component reliability contains data about failure rates of different components.

Dynamic Replay

Warning

This section was generated by ChatGPT. It still has to be vetted.

Dynamic replay refers to re-computing only the lost or corrupted portions of a training step when a fault occurs, instead of restarting from a checkpoint. It allows faster recovery with minimal impact on job progress.

How it works:

During forward and backward passes, intermediate tensors (activations, gradients) can be cached or recomputed on-demand.
If a node fails mid-step, surviving nodes can replay just the affected microbatches or layers.
Some systems use operator-level recomputation with dependency tracking (e.g. in Megatron-DeepSpeed or FairScale).

Anecdotes

ByteDance’s MegaScale paper³ contains descriptions of the entire infrastructure required to train at 10K+ GPU scale.

Meta AI’s OPT-175B logbook⁴ provides specific details about errors and mitigations that happened while training an LLM across 1K GPUs for two months.

See OpenAI Keynote on Building Scalable AI Infrastructure and the scaling plots cited from the GPT-4 technical report. ↩
Multi-Datacenter Training: OpenAI’s Ambitious Plan To Beat Google’s Infrastructure ↩
[2402.15627] MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs (arxiv.org) ↩
metaseq/projects/OPT/chronicles at main · facebookresearch/metaseq (github.com) ↩

Glenn's Digital Garden

Explorer

LLM training at scale

Multi-data center training

Silent data corruption

Asynchronous training

Convergence

Resilience

Dynamic Replay

Anecdotes

Graph View

Table of Contents

Backlinks

Glenn's Digital Garden

Explorer

LLM training at scale

Multi-data center training

Silent data corruption

Asynchronous training

Convergence

Resilience

Dynamic Replay

Anecdotes

Footnotes

Graph View

Table of Contents

Backlinks