Here is a list of open-source inferencing frameworks. I mostly care about them because these are the interface through which VAST customers interact with KV cache.
- vLLM came out of Ion Stoica’s lab and is now part of the PyTorch foundation.
- SGLang has the same founding DNA as vLLM and is a collaboration between Berkeley, Stanford, UCSD, CMU, and MBZUAI. It’s the successor to vLLM.
- TensorRT-LLM is NVIDIA’s inferencing runtime, built on top of the TensorRT SDK.
In terms of KV caching, all three support storing cached keys/values across memory tiers:
- vLLM implements PagedAttention for KV cache offload
- SGLang implements RadixAttention for KV cache offload
- TensorRT-LLM implements a paged KV cache similar to PagedAttention
These KV cache offloads typically only to CPU memory though. To move those cached keys/values to storage, you need a different framework:
- LMCache was built to work with vLLM and manage KV cache offloads to storage
- Dynamo’s KV Block Manager (KVBM) does the same thing as LMCache
Higher-level stacks are built on top of some of the above frameworks as well:
- Dynamo is NVIDIA’s distributed inferencing runtime. It supports vLLM, SGLang, and TensorRT-LLM.1 It implements a KV Block Manager which can offload KV cache from memory to storage.
- llm-d is a “distributed inference serving stack” which incorporates vLLM.
- vLLM Production Stack is a distributed “inference stack on top of vLLM.”