The hidden dimension or model dimension or residual stream dimension of a transformer is a hyperparameter that describes how many features every token has. The number of weights a transformer has is a function of this hyperparameter.
A few examples:
- GPT-3 175B has a hidden dimension of 12,288.1
- Llama 3.1 405B has a hidden dimension of 16,384.
- DeepSeek-V3 (which is the basis for DeepSeek-R1) has a hidden dimension of 7,168.2
Footnotes
-
Brown et al. Language models are few-shot learners ↩
-
Liu et al. DeepSeek-V3 Technical Report. ↩