There are two ways to serve many simultaneous long-context LLM requests, compared on hourly cost and time-to-first-token (TTFT):

  • Context parallelism (CP): Buy enough GPUs that all KV caches fit in GPU HBM simultaneously. More GPUs also parallelizes prefill, reducing TTFT.
  • KV cache offload: Use the minimum GPUs needed to run the model. Spill KV caches to storage instead of HBM. Reload per request at serving time.

Model & workload

128K
50

GPU server

80 GB
$3.50/hr
16 TB
56 GB/s

External storage (overflow only)

10 GB/s
$0.20/GB/mo

Overflow threshold

KV cache / request

Total KV (all concurrent)

Servers (offload / CP)

Offload cost vs CP

CP prefill KV offload (local NVMe) Min GPU prefill

TTFT comparison

GPU Local NVMe (free) External storage

Hourly cost breakdown

CP approach KV offload (GPU cost only, local NVMe free) KV offload (once external storage needed)

Hourly cost vs concurrency — showing local NVMe overflow threshold

The TTFT comparison plot shows the performance behavior:

  • CP (sufficient HBM) is the performance you get from deploying more GPUs (8 per server) to get more HBM. As you get more HBM, you also get more FLOPS, so prefill time decreases.
  • KV offload is the performance you get if you simply don’t prefill and instead just read from storage (local NVMe or external, depending on the Total KV (all concurrent) capacity.
  • Min-GPU prefill is the performance you get from deploying the minimum number of GPUs required to fit all model weights and a single session’s KV cache.

There are interesting knock-on effects of increasing concurrency or context; for example, prefill gets faster, because you are using more GPUs to store more concurrent sessions’ KV caches in HBM. But GPUs are way more expensive than storage, so it’s always cheaper to retrieve from offloaded cache.

How KV cache size is computed

For a 70B model at 128K tokens it works out to ~42 GB per active conversation. The equation changes a little for mixture of experts models, but not fundamentally.

Key assumptions

Here are the gotchas in the model:

  • TTFT for CP is modeled as . This assumes linear scaling with GPU count and ignores ring attention communication overhead, which grows with CP degree. Real TTFT will be higher than shown at large CP values.
  • TTFT for offload is modeled as . This assumes the storage system can sustain its rated bandwidth to a single request.
  • GPU TFLOPS is effective throughput after MFU, not peak. 100 TFLOPS for an H100 is probably a conservative estimate; real utilization will vary with batch size and sequence length and should be benchmarked.
  • Local NVMe bandwidth is the aggregate of all drives on the server. Whether a single request can actually saturate that depends on how good the KV cache software (SGLang, etc) is.
  • Assume bf16 for everything; quantization halves or quarters the weight footprint, shifting the GPU minimum downward.
  • Single model replica is being served. With more replicas, throughput increases but the effects of local NVMe partitions also comes into play, increasing the value of external storage.

Parameters

SliderWhat changes
Context lengthKV cache per request grows linearly; overflow threshold and TTFT both shift
ConcurrencyTotal KV cache grows linearly; CP GPU count steps up in multiples of tp_base (e.g., 8 GPUs per server)
HBM per GPUHigher HBM delays the point where CP needs additional GPUs
GPU costScales both approaches proportionally; doesn’t change the ratio unless storage cost is significant
GPU TFLOPSAffects CP TTFT only; offload TTFT is I/O bandwidth-bound
Local NVMe per serverSets the “free” (zero-dollar) storage capacity before external storage is needed for offload
Local NVMe bandwidthPrimary driver of offload TTFT
External storage bandwidthOnly matters after local NVMe runs out of capacity
External storage costOnly affects cost after local NVMe overflows

Not modeled

This model ignores a bunch of things:

  • Decode latency: tokens per second after first token are not reflected anywhere. This is only prefill, because decode must access keys and values in HBM.
  • KV cache offload writes: only include the read bandwidth part; the cost of flushing KV cache from HBM to storage is not a part of prefill and therefore not modeled.
  • KV cache compression/quantization: reduces the capacity of the KV cache
  • Prefix caching: Shared prefix caching across requests would increase KV cache hit rate in different tiers. Too complex to model without a specific workload trace.