Unique challenges arise when performing LLM training at the scale required for frontier models, and there is no end in sight for training increasingly large models to get higher quality results.1

Multi-data center training

See multicluster training.

Silent data corruption

See silent data corruption.

Asynchronous training

Asynchronous training is a nascent technique to allow training to occur where model weights are not globally synchronized continuously. This allows parts of training to continue even if a single GPU, node, or data-parallel model replica fails. Dylan Patel wrote a good summary of this.2

Convergence

Large language models are not trained to convergence. Instead, empirical scaling laws are used to determine the ideal number of training tokens required for a model of a given size, then this amount of data is used to train the model. From the PaLM 2 Technical Report:

Scaling law experiments

Scaling Transformer language models has become a popular way to achieve state-of-the-art performance. Kaplan et al. (2020) studied the relationship between scaling the amount of training data (D) and model size (N), and reached the empirical conclusion that it follows a power law, with N needing to grow faster than D. Hoffmann et al. (2022) built upon this observation with a similar study that tuned smaller models’ hyperparameters better. Their results corroborated Kaplan et al. (2020)’s power law conclusion; however, they arrived at different results regarding the optimal ratios, showing that N and D should instead grow in equal proportions.

No such scaling laws exist for training foundation models for science which forces that field to do extensive hyperparameter optimization prior to training. This is one of the places where foundation models for science are far behind the state of the art.

Resilience

GPUs and servers are constantly failing when training at scale.

  • availability discusses the techniques for minimizing job downtime caused by crashes and subsequent restart time.
  • component reliability contains data about failure rates of different components.

Anecdotes

  • ByteDance’s MegaScale paper3 contains descriptions of the entire infrastructure required to train at 10K+ GPU scale.
  • Meta AI’s OPT-175B logbook4 provides specific details about errors and mitigations that happened while training an LLM across 1K GPUs for two months.

Footnotes

  1. See OpenAI Keynote on Building Scalable AI Infrastructure and the scaling plots cited from the GPT-4 technical report.

  2. Multi-Datacenter Training: OpenAI’s Ambitious Plan To Beat Google’s Infrastructure

  3. [2402.15627] MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs (arxiv.org)

  4. metaseq/projects/OPT/chronicles at main · facebookresearch/metaseq (github.com)