The key-value cache is used during LLM inferencing to accelerate attention part of transformers. They exploit the fact that, as previously generated tokens are used to generate new tokens (autoregressive decoding), old tokens’ key and value vectors do not change.
Every new token being generated during decode only depends on the tokens that precede it, not the ones that haven’t yet been generated. This means that previously generated tokens (and their key and value vectors) do not change once they are generated. It also means K and V vectors for older tokens can be computed once and repeatedly used as the output tokens are being generated.
This repeated re-use of K and V vectors gives rise to KV caches which store all the key/value vectors for previously generated tokens while generating next tokens.
I wrote a more detailed explanation of why key and value vectors are cacheable in Full attention.
Applications
KV caches are very useful for prefix caching, where every query to a chatbot is preceded by the same system prompts.
More generally, KV caches are useful for long-context inferencing, which is prevalent in
- AI models for science which operate on huge amounts of scientific data as input (telescope images, etc).
- Code generation, where entire codebases can be included with a prompt. More generally, this can be extended to RAG with lots of relevant context.
- Multi-turn chat session with long wait times between turns.
Finally, I believe there is utility in KV caching for disaggregated inferencing. A fast, global KV cache allows prefill to be desynchronized from decode.
Implementation
KV caches can be implemented at multiple levels of the memory hierarchy:
- GPU HBM: This is where KV vectors are stored.
- CPU DRAM: This is a larger pool where KV vectors that don’t belong to the current multi-turn inferencing session can go. The current session cannot be cached in DRAM because all KV vectors for a conversation must be processed in HBM to generate the next token for that conversation.
- Local SSD: This is an even slower, even bigger pool where KV vectors can be cached. It has the same utility as the CPU DRAM.
- Remote storage: Same as above.
NVIDIA Dynamo defines these tiers as G1, G2, G3, and G4, but it also attributes a data lifecycle to these tiers.1
Sizing
The size of a single key or value vector is the product of:1
- Number of layers
- Number of attention heads per layer
- Dimension of the attention head
- Precision of the key/value
Optimizations
Space efficiency
Distributed KV caches can become “perforated,” where parts of a transformer’s cached vectors are missing due to either eviction or failure of one of the cache nodes. vLLM is adding support for recomputing KV vectors for perforated parts of a cache2 without throwing out the entire cache.
ChunkKV3 is a technique that demonstrates how low-scoring tokens in context can be simply deleted from attention in blocks to reduce the number of FLOPS required to process a prefill. You essentially “delete” the existence of pointless tokens from the prefix to reduce both memory and compute requirements.
Re-use
See CacheBlend, which is a seminal work that demonstrates how cached keys and values can be stitched into prompts.
In addition,
- DroidSpeak4 (Microsoft Research) demonstrates how KV cache can be used across fine-tuned variants of the same model.
- SmartCache5 (Chinese University of Hong Kong/Shenzhen University/UT Arlington) and SemShareKV6 (Notre Dame) propose ways to share semantically similar but not token-wise-identical prompts (or chunks of prompts?)
These techniques effectively turn inference into a semantic search problem; they rewrite prompts in a way that allows them to match prompts whose keys and values have already been cached. The more aggressively these prompts are rewritten to enable cache hits, the lower the quality of the inference output.
Footnotes
-
https://www.vastdata.com/blog/nvidia-dynamo-vast-scalable-optimized-inference ↩ ↩2
-
[2502.00299] ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference ↩
-
DroidSpeak: Efficient Context Sharing for Multiple-LLM Inference - Microsoft Research ↩
-
SmartCache: Context-aware Semantic Cache for Efficient Multi-turn LLM Inference | OpenReview ↩
-
[2509.24832] SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching ↩