Llama-3 is a family of models developed by Meta and described in The Llama-3 Herd of Models (arxiv.org). Its largest form, Llama-3.1 405b, may be considered a frontier model.
Llama-3 is a dense transformer that is a bigger and better-trained version of the previous Llama-2 model. Compared to that model,
- They trained on more, higher-quality data (15.6 trillion tokens vs. 1.8 trillion for Llama 2)
- They trained more, using 2,048 nodes1 on Meta’s H100 RoCE cluster and cranking through ops total in bfloat16.
Of note though, Llama-3 uses Grouped-Query Attention (GQA) instead of multi-head attention to reduce the number of key and value tensors, reducing computation requirements and memory footprint.2 They deliberately did not use mixture of experts. They also used a significantly larger (128K) vocabulary size which allowed them to train Llama-3 as a multilingual model.
Oxen.ai has a great summary of the paper.3 In brief, the paper has:
- A good explanation of how they cleaned their training data.
- Great anecdotes about component reliability and JMTTI
- A description of techniques they used to train long contexts, anneal the model, and other practical things.
Post-training Llama-3 involved supervised fine-tuning, rejection sampling, and DPO.4
Hyperparameters
Pavan Balaji presented the following hyperparameters:1
| GPUs | Tensor Parallelism | Context Parallelism | Pipeline Parallelism | Tokens/batch | TFLOPS/GPU |
|---|---|---|---|---|---|
| 8,192 | 8 | 1 | 16 | 16M | 430 |
| 16,384 | 8 | 1 | 16 | 16M | 400 |
| 16,384 | 8 | 16 | 16 | 16M | 380 |
This table is probably in the paper as well.
Llama-3.2
The Llama-3 paper4 also describes how they bolted a vision transformer into Llama to give it visual reasoning capabilities. That model was 506B parameters (405B base, 0.63B for the vision transformer, and 100B for the cross-attention adapters) and was never released. However, this approach was applied to Llama-3 70B and 7B, resulting in Llama-3.2.
Footnotes
-
Balaji, Herding Llamas: A Sneak Peek Into Meta’s Infrastructure for Generative AI. SC’24. He showed a slide with hyperparameters which included 16,384 GPUs. ↩ ↩2