Continuous batching is an optimization where GPUs are continuously performing forward passes, but the batch they’re processing is dynamically constructed based on whatever tokens from whatever prompts are ready for the forward pass. I think continuous batching was introduced in 2022.1

Example

Let’s way we have a GPU node with the following requests to service:

RequestPrompt TypePrefill Time# output tokens
R1long query2 seconds4
R2short query1 second2
R3short query1 second1

Static batching

  1. At , all requests are prefilling.
  2. At , R1 is still prefilling. R2 and R3 are idle.
  3. At , all requests are decoding their first token
  4. At , R1 and R2 are decoding their second token. R3 is done and is padding.
  5. At , R1 is decoding. R2 and R3 are both done and padding.
  6. At , R1 is still decoding. R2 and R3 are still done.

The GPU starts idling out after step 3 as R3 is done, and by step 4, the GPU is only processing one request even though it’s capable of processing three.

Furthermore, a new request (R4) cannot begin until the above batch is fully complete (after step 6). This introduces an unnecessarily high time to first token for R4 if the request arrives after Step 3, because the GPU does have idle capacity to take on a new request.

Continuous batching

  1. At , all requests are prefilling. R2 and R3 are done prefilling, so they are moved into a KV cache.
  2. At , R1 is still prefilling. The GPU can handle one more prefill using the idle slots freed up by R2 and R3 if a prefill request came in.
  3. At , all requests are decoding their first token. R1 goes straight from prefill to decode, while R2 and R3 may reload their prefilled K and V vectors from KV cache.
  4. At , R1 and R2 are decoding their second token. R3 is done and off the GPU.
  5. At , R1 is decoding. R2 and R3 are both done and off the GPU.
  6. At , R1 is decoding its last token. R2 and R3 are done.

If a new request R4 shows up at t=3,

  1. At , R4 begins to prefill while R1 and R2 are decoding.
  2. At , R1 is decoding its third token. R4 begins decoding its first token.
  3. At , R1 is decoding its last token and R4 is decoding its second token.

This illustrates how continuous batching can increase GPU utilization and allow R4 to begin processing without waiting for the existing batch to complete, reducing time to first token.

Footnotes

  1. https://www.usenix.org/conference/osdi22/presentation/yu