Here is a list of open-source inferencing frameworks. I mostly care about them because these are the interface through which VAST customers interact with KV cache.

  • vLLM came out of Ion Stoica’s lab and is now part of the PyTorch foundation.
  • SGLang has the same founding DNA as vLLM and is a collaboration between Berkeley, Stanford, UCSD, CMU, and MBZUAI. It’s the successor to vLLM.
  • TensorRT-LLM is NVIDIA’s inferencing runtime, built on top of the TensorRT SDK.

In terms of KV caching, all three support storing cached keys/values across memory tiers:

  • vLLM implements PagedAttention for KV cache offload
  • SGLang implements RadixAttention for KV cache offload
  • TensorRT-LLM implements a paged KV cache similar to PagedAttention

These KV cache offloads typically only to CPU memory though. To move those cached keys/values to storage, you need a different framework:

Higher-level stacks are built on top of some of the above frameworks as well:

  • Dynamo is NVIDIA’s distributed inferencing runtime. It supports vLLM, SGLang, and TensorRT-LLM.1 It implements a KV Block Manager which can offload KV cache from memory to storage.
  • llm-d is a “distributed inference serving stack” which incorporates vLLM.
  • vLLM Production Stack is a distributed “inference stack on top of vLLM.”

Footnotes

  1. Dynamo Inference Framework | NVIDIA Developer