KV cache offload cost model

There are two ways to serve many simultaneous long-context LLM requests, compared on hourly cost and time-to-first-token (TTFT):

Context parallelism (CP): Buy enough GPUs that all KV caches fit in GPU HBM simultaneously. More GPUs also parallelizes prefill, reducing TTFT.
KV cache offload: Use the minimum GPUs needed to run the model. Spill KV caches to storage instead of HBM. Reload per request at serving time.

Model & workload

Model

Context 128K

Concurrency 50

GPU server

HBM per GPU 80 GB

GPU cost $3.50/hr

Local NVMe per server 16 TB

Local NVMe bandwidth 56 GB/s

External storage (overflow only)

Bandwidth 10 GB/s

Cost $0.20/GB/mo

Overflow threshold

—

KV cache / request

—

Total KV (all concurrent)

—

Servers (offload / CP)

—

Offload cost vs CP

—

CP prefill KV offload (local NVMe) Min GPU prefill

TTFT comparison

GPU Local NVMe (free) External storage

Hourly cost breakdown

CP approach KV offload (GPU cost only, local NVMe free) KV offload (once external storage needed)

Hourly cost vs concurrency — showing local NVMe overflow threshold

The TTFT comparison plot shows the performance behavior:

CP (sufficient HBM) is the performance you get from deploying more GPUs (8 per server) to get more HBM. As you get more HBM, you also get more FLOPS, so prefill time decreases.
KV offload is the performance you get if you simply don’t prefill and instead just read from storage (local NVMe or external, depending on the Total KV (all concurrent) capacity.
Min-GPU prefill is the performance you get from deploying the minimum number of GPUs required to fit all model weights and a single session’s KV cache.

There are interesting knock-on effects of increasing concurrency or context; for example, prefill gets faster, because you are using more GPUs to store more concurrent sessions’ KV caches in HBM. But GPUs are way more expensive than storage, so it’s always cheaper to retrieve from offloaded cache.

How KV cache size is computed

$KV cache per request = 2 \times layers \times KV heads \times head dimension \times 2 bytes \times context length$

For a 70B model at 128K tokens it works out to ~42 GB per active conversation. The equation changes a little for mixture of experts models, but not fundamentally.

Key assumptions

Here are the gotchas in the model:

TTFT for CP is modeled as $\frac{prefill FLOPs}{GPU count \times TFLOPS})$ . This assumes linear scaling with GPU count and ignores ring attention communication overhead, which grows with CP degree. Real TTFT will be higher than shown at large CP values.
TTFT for offload is modeled as $\frac{KV size}{storage bandwidth} + 0.1 sec$ . This assumes the storage system can sustain its rated bandwidth to a single request.
GPU TFLOPS is effective throughput after MFU, not peak. 100 TFLOPS for an H100 is probably a conservative estimate; real utilization will vary with batch size and sequence length and should be benchmarked.
Local NVMe bandwidth is the aggregate of all drives on the server. Whether a single request can actually saturate that depends on how good the KV cache software (SGLang, etc) is.
Assume bf16 for everything; quantization halves or quarters the weight footprint, shifting the GPU minimum downward.
Single model replica is being served. With more replicas, throughput increases but the effects of local NVMe partitions also comes into play, increasing the value of external storage.

Parameters

Slider	What changes
Context length	KV cache per request grows linearly; overflow threshold and TTFT both shift
Concurrency	Total KV cache grows linearly; CP GPU count steps up in multiples of `tp_base` (e.g., 8 GPUs per server)
HBM per GPU	Higher HBM delays the point where CP needs additional GPUs
GPU cost	Scales both approaches proportionally; doesn’t change the ratio unless storage cost is significant
GPU TFLOPS	Affects CP TTFT only; offload TTFT is I/O bandwidth-bound
Local NVMe per server	Sets the “free” (zero-dollar) storage capacity before external storage is needed for offload
Local NVMe bandwidth	Primary driver of offload TTFT
External storage bandwidth	Only matters after local NVMe runs out of capacity
External storage cost	Only affects cost after local NVMe overflows

Not modeled

This model ignores a bunch of things:

Decode latency: tokens per second after first token are not reflected anywhere. This is only prefill, because decode must access keys and values in HBM.
KV cache offload writes: only include the read bandwidth part; the cost of flushing KV cache from HBM to storage is not a part of prefill and therefore not modeled.
KV cache compression/quantization: reduces the capacity of the KV cache
Prefix caching: Shared prefix caching across requests would increase KV cache hit rate in different tiers. Too complex to model without a specific workload trace.

Glenn's Digital Garden

Explorer

KV cache offload cost model

How KV cache size is computed

Key assumptions

Parameters

Not modeled

Graph View

Table of Contents