The key-value cache is used during LLM inferencing to accelerate attention part of transformers. They exploit the fact that, as previously generated tokens are used to generate new tokens (autoregressive decoding), old tokens’ key and value vectors do not change.

Every new token being generated during decode only depends on the tokens that precede it, not the ones that haven’t yet been generated. This means that previously generated tokens (and their key and value vectors) do not change once they are generated. It also means K and V vectors for older tokens can be computed once and repeatedly used as the output tokens are being generated.

This repeated re-use of K and V vectors gives rise to KV caches which store all the key/value vectors for previously generated tokens while generating next tokens.

I wrote a more detailed explanation of why key and value vectors are cacheable in Full attention.

Applications

KV caches are very useful for prefix caching, where every query to a chatbot is preceded by the same system prompts.

More generally, KV caches are useful for long-context inferencing, which is prevalent in

  1. AI models for science which operate on huge amounts of scientific data as input (telescope images, etc).
  2. Code generation, where entire codebases can be included with a prompt. More generally, this can be extended to RAG with lots of relevant context.
  3. Multi-turn chat session with long wait times between turns.

Finally, I believe there is utility in KV caching for disaggregated inferencing. A fast, global KV cache allows prefill to be desynchronized from decode.

Implementation

KV caches can be implemented at multiple levels of the memory hierarchy:

  • GPU HBM: This is where KV vectors are stored.
  • CPU DRAM: This is a larger pool where KV vectors that don’t belong to the current multi-turn inferencing session can go. The current session cannot be cached in DRAM because all KV vectors for a conversation must be processed in HBM to generate the next token for that conversation.
  • Local SSD: This is an even slower, even bigger pool where KV vectors can be cached. It has the same utility as the CPU DRAM.
  • Remote storage: Same as above.

NVIDIA

NVIDIA Dynamo defines these tiers as G1, G2, G3, and G4, but it also attributes a data lifecycle to these tiers.1

OpenAI

OpenAI implements two tiers of prefix caching (KV caching):2

  1. In-memory prompt cache, where KV matrices are retained in GPU HBM for 5-10 minutes and up to one hour. This seems expensive.
  2. Extended prompt cache, where KV cache is offloaded to node-local NVMe for a maximum of 24 hours.

Cost: This prefix caching capability is automatic and has no price difference for either queries or responses.

Minimum prefix: OpenAI automatically performs KV caching of API queries with prompts of at least 1024 tokens.2

Privacy: Interestingly, OpenAI does share prefix caches across users within an organization (tenant).

Anthropic

Anthropic implements prefix caching with explicit controls. It is unclear where these tiers of cache are implemented, but they have two cache retention durations:3

  • 5-minute retention period, likely implemented in GPU HBM
  • 1-hour retention period, likely implemented in node-local NVMe

Both retention durations are referred to as ephemeral type though, so perhaps the 1-hour retention is still in a volatile memory (HBM or CPU DRAM?)

Anthropic lets you explicitly control breakpoints in your prompt for caching.

Cost: Anthropic has different pricing based on whether you opt-in to using prefix caching or not:

  • Default is no caching of prefixes, which has a cost of 1.0x
  • With caching enabled,
    • Writing tokens to the 5-minute cache bills at 1.25x the price of input tokens
    • Writing tokens to the 60-minute cache bills at 2.00x the price of input tokens
    • Reading tokens from either cache bills at 0.10x the price of output tokens

Minimum prefix: Caching requires minimum prefix lengths ranging from 1024 to 4096 tokens, presumably related to the cost of prefill versus HBM. Newer Mythos and Opus models require at least 4096 tokens, and newer Sonnets require 2048. Most models require 1024, and Haiku requires 4096. Haiku is probably so cheap to prefill that caching doesn’t make sense for anything but long contexts.

Privacy: Anthropic used to share cached prefixes across users within an organization (tenant), but in 2026, reduced the scope to sharing only within a workspace.3

Sizing

The size of a single key or value vector is the product of:1

  • Number of layers
  • Number of attention heads per layer
  • Dimension of the attention head
  • Precision of the key/value

Optimizations

Space efficiency

Distributed KV caches can become “perforated,” where parts of a transformer’s cached vectors are missing due to either eviction or failure of one of the cache nodes. vLLM is adding support for recomputing KV vectors for perforated parts of a cache4 without throwing out the entire cache.

ChunkKV5 is a technique that demonstrates how low-scoring tokens in context can be simply deleted from attention in blocks to reduce the number of FLOPS required to process a prefill. You essentially “delete” the existence of pointless tokens from the prefix to reduce both memory and compute requirements.

Re-use

See CacheBlend, which is a seminal work that demonstrates how cached keys and values can be stitched into prompts.

In addition,

  • DroidSpeak6 (Microsoft Research) demonstrates how KV cache can be used across fine-tuned variants of the same model.
  • SmartCache7 (Chinese University of Hong Kong/Shenzhen University/UT Arlington) and SemShareKV8 (Notre Dame) propose ways to share semantically similar but not token-wise-identical prompts (or chunks of prompts?)

These techniques effectively turn inference into a semantic search problem; they rewrite prompts in a way that allows them to match prompts whose keys and values have already been cached. The more aggressively these prompts are rewritten to enable cache hits, the lower the quality of the inference output.

Footnotes

  1. https://www.vastdata.com/blog/nvidia-dynamo-vast-scalable-optimized-inference 2

  2. Prompt caching | OpenAI API 2

  3. Prompt caching - Claude API Docs 2

  4. https://github.com/vllm-project/vllm/issues/25950

  5. [2502.00299] ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

  6. DroidSpeak: Efficient Context Sharing for Multiple-LLM Inference - Microsoft Research

  7. SmartCache: Context-aware Semantic Cache for Efficient Multi-turn LLM Inference | OpenReview

  8. [2509.24832] SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching