Continuous Batching
Scheduling technique that adds new requests to a running batch as soon as any sequence finishes, maximising GPU utilisation compared to static batching.
Definition
Traditional static batching waits until all sequences in a batch have finished before starting the next batch, leaving the GPU idle whenever a short sequence completes. Continuous batching (also called iteration-level scheduling or in-flight batching) inserts new requests into the batch at each decoding iteration. Because the GPU processes a new token for every active sequence per step, newly added requests begin contributing to throughput immediately. This dramatically improves GPU utilisation when sequence lengths vary widely, which is typical in production LLM workloads.
Related
More Software terms
PagedAttention
vLLM's technique for storing KV cache in non-contiguous memory pages, eliminating fragmentation and enabling larger effective batch sizes.
FlashAttention
IO-aware exact attention algorithm that tiles computation to stay in SRAM, cutting HBM reads/writes and speeding up attention by 2–4×.
vLLM
Open-source LLM inference and serving library from UC Berkeley featuring PagedAttention, continuous batching, and broad model support.