Inferencing is different from training in that models are often quantized to reduced precisions to save on the memory and computational requirements to process requests.
From the DeepSpeed-FastGen paper:
- prefill or prompt processing
- input is user-provided text (the prompt)
- output is a key-value cache for attention
- compute-bound and scales with the input length
- decode or token generation:
- adds a token to the KV cache, then generates a new token
- memory bandwidth-bound and shows approximately O(1) scaling
Efficient Memory Management for Large Language Model Serving with PagedAttention by Kwon et al describes how GPU memory is consumed during inferencing.
Disaggregated inferencing
As of 2024, disaggregated inferencing has become a hot topic. This is a technique where prefill and decode occur on different GPUs to exploit the different bottlenecks of each (compute and memory bandwidth, respectively).
Disaggregated inferencing was first described by Microsoft in 20231 and was featured prominently during GTC25 by Jensen Huang’s keynote. It is the basis for Dynamo.
In practice
Open-source
vLLM and TensorRT are open-source frameworks for implementing inferencing services.
- llm-d is a “distributed inference serving stack” which incorporates vLLM.
- vLLM Production Stack is a distributed “inference stack on top of vLLM.”
Azure
AI Foundry has an inferencing-as-a-service feature. Not sure how this works as of 2025.
ChatGPT
ChatGPT stores conversations, prompts, and metadata in Azure Cosmos DB.2
ChatGPT is built on Azure Kubernetes Service.2
Footnotes
-
Splitwise: Efficient generative LLM inference using phase splitting ↩
-
Scott Guthrie’s keynote at Microsoft Build 2025 - Unpacking the tech ↩ ↩2