Multi-head Latent Attention

MLA is a way of computing attention that “compresses” and “decompresses” the KV vectors which reduces the amount of KV cache (and therefore GPU memory) required during decode.

Latent attention factorizes the $W_{Q}$ matrix into two smaller matrices. One such matrix can get go into the KV cache, and the other holds weights. Both take less memory, and you rehydrate the original full matrix only when needed.

This approach effectively stores a compressed representation (a latent representation) of the $K$ and $V$ tensors that are shared across the heads in each transformer layer. During a forward pass, each attention head rehydrates its real $K$ and $V$ using this latent representation on-demand. This rehydration is more computationally expensive than storing $K$ and $V$ in memory outright, but it saves a ton of memory and allows you to fit a larger model in the same GPU memory footprint.

This method was used by DeepSeek-R1.

Glenn's Digital Garden

Explorer

Multi-head Latent Attention

Graph View

Backlinks