Meta Llama-3.1

Llama-3 is a family of models developed by Meta and described in The Llama-3 Herd of Models (arxiv.org). Its largest form, Llama-3.1 405b, may be considered a frontier model.

Llama-3 is a dense transformer that is a bigger and better-trained version of the previous Llama-2 model. Compared to that model,

They trained on more, higher-quality data (15.6 trillion tokens vs. 1.8 trillion for Llama 2)
They trained more, using 2,048 nodes¹ on Meta’s H100 RoCE cluster and cranking through $3.8 \times 1 0^{25}$ ops total in bfloat16.

Of note though, Llama-3 uses Grouped-Query Attention (GQA) instead of multi-head attention to reduce the number of key and value tensors, reducing computation requirements and memory footprint.² They deliberately did not use mixture of experts. They also used a significantly larger (128K) vocabulary size which allowed them to train Llama-3 as a multilingual model.

Oxen.ai has a great summary of the paper.³ In brief, the paper has:

A good explanation of how they cleaned their training data.
Great anecdotes about component reliability and JMTTI
A description of techniques they used to train long contexts, anneal the model, and other practical things.

Post-training Llama-3 involved supervised fine-tuning, rejection sampling, and Direct Preference Optimization.⁴

Hyperparameters

Pavan Balaji presented the following hyperparameters:¹

GPUs	Tensor Parallelism	Context Parallelism	Pipeline Parallelism	Tokens/batch	TFLOPS/GPU
8,192	8	1	16	16M	430
16,384	8	1	16	16M	400
16,384	8	16	16	16M	380

This table is probably in the paper as well.

Balaji, Herding Llamas: A Sneak Peek Into Meta’s Infrastructure for Generative AI. SC’24. He showed a slide with hyperparameters which included 16,384 GPUs. ↩ ↩²
Movie Gen: A Cast of Media Foundation Models ↩
arXiv Dive: How Meta Trained Llama 3.1 (oxen.ai) ↩
The Llama-3 Herd of Models (arxiv.org) ↩

Glenn's Digital Garden

Explorer

Meta Llama-3.1

Hyperparameters

Graph View

Backlinks

Glenn's Digital Garden

Explorer

Meta Llama-3.1

Hyperparameters

Footnotes

Graph View

Backlinks