continuous batching

Continuous batching is an optimization where GPUs are continuously performing forward passes, but the batch they’re processing is dynamically constructed based on whatever tokens from whatever prompts are ready for the forward pass. I think continuous batching was introduced in 2022.¹

Example

Let’s way we have a GPU node with the following requests to service:

Request	Prompt Type	Prefill Time	# output tokens
R1	long query	2 seconds	4
R2	short query	1 second	2
R3	short query	1 second	1

Static batching

At $t = 0$ , all requests are prefilling.
At $t = 1$ , R1 is still prefilling. R2 and R3 are idle.
At $t = 2$ , all requests are decoding their first token
At $t = 3$ , R1 and R2 are decoding their second token. R3 is done and is padding.
At $t = 4$ , R1 is decoding. R2 and R3 are both done and padding.
At $t = 5$ , R1 is still decoding. R2 and R3 are still done.

The GPU starts idling out after step 3 as R3 is done, and by step 4, the GPU is only processing one request even though it’s capable of processing three.

Furthermore, a new request (R4) cannot begin until the above batch is fully complete (after step 6). This introduces an unnecessarily high time to first token for R4 if the request arrives after Step 3, because the GPU does have idle capacity to take on a new request.