MLA is a way of computing attention that “compresses” and “decompresses” the KV vectors which reduces the amount of KV cache (and therefore GPU memory) required during decode.
Latent attention factorizes the matrix into two smaller matrices. One such matrix can get go into the KV cache, and the other holds weights. Both take less memory, and you rehydrate the original full matrix only when needed.
This approach effectively stores a compressed representation (a latent representation) of the and tensors that are shared across the heads in each transformer layer. During a forward pass, each attention head rehydrates its real and using this latent representation on-demand. This rehydration is more computationally expensive than storing and in memory outright, but it saves a ton of memory and allows you to fit a larger model in the same GPU memory footprint.
This method was used by DeepSeek-R1.