The key-value cache is used during LLM inferencing to exploit the fact that, as previously generated tokens are used to generate new tokens, the old tokens’ key and value vectors do not change.

Every new token being generated during decode only depends on the tokens that precede it, not the ones that haven’t yet been generated. This means that previously generated tokens (and their key and value vectors) do not change once they are generated. It also means K and V vectors for older tokens can be computed once and repeatedly used as the output tokens are being generated.

This repeated re-use of K and V vectors gives rise to KV caches which store all the key/value vectors for previously generated tokens while generating next tokens.

I wrote a more detailed explanation of why key and value vectors are cacheable in Full attention.