Prefix caching is an optimization during LLM inferencing where the key and value vectors for a common prefix are cached and shared across queries. For example, consider a few different prompts:
How do I cook a Chinese eggplant?
How are babies made?
How does Waymo work?
The tokens for How
are common across all of these queries, and How do
might be common across two of them. Prefix caching allows those common prefixes’ key and value vectors to be computed once, then simply loaded from cache when those prefixes are encountered again.
There are a couple cases where prefix caching becomes hugely beneficial:
- Long system prompts. Every prompt issued to a chatbot will have a bunch of hidden system prompts that are repeated at the beginning of every single query. For example, Claud Opus 4.1 has a system prompt1 that is 3,377 tokens long.2
- Multi-turn conversation. When you ask a question and get an answer, the next question you ask is really a new prompt that contains the previous question/answer in the context window. Caching keys and values for that entire previous prompt avoids the latency of another prefill of a whole new prompt.