inferencing frameworks

Here is a list of open-source inferencing frameworks. I mostly care about them because these are the interface through which VAST customers interact with KV cache.

vLLM came out of Ion Stoica’s lab and is now part of the PyTorch foundation.
SGLang has the same founding DNA as vLLM and is a collaboration between Berkeley, Stanford, UCSD, CMU, and MBZUAI. It’s the successor to vLLM.
TensorRT-LLM is NVIDIA’s inferencing runtime, built on top of the TensorRT SDK.

In terms of KV caching, all three support storing cached keys/values across memory tiers:

vLLM implements PagedAttention for KV cache offload
SGLang implements RadixAttention for KV cache offload
TensorRT-LLM implements a paged KV cache similar to PagedAttention

These KV cache offloads typically only to CPU memory though. To move those cached keys/values to storage, you need a different framework:

LMCache was built to work with vLLM and manage KV cache offloads to storage
Dynamo’s KV Block Manager (KVBM) does the same thing as LMCache

Higher-level stacks are built on top of some of the above frameworks as well:

Dynamo is NVIDIA’s distributed inferencing runtime. It supports vLLM, SGLang, and TensorRT-LLM.¹ It implements a KV Block Manager which can offload KV cache from memory to storage.
llm-d is a “distributed inference serving stack” which incorporates vLLM.
vLLM Production Stack is a distributed “inference stack on top of vLLM.”

Dynamo Inference Framework | NVIDIA Developer ↩

Glenn's Digital Garden

Explorer

inferencing frameworks

Graph View

Backlinks

Glenn's Digital Garden

Explorer

inferencing frameworks

Footnotes

Graph View

Backlinks