Software

Continuous Batching

Scheduling technique that adds new requests to a running batch as soon as any sequence finishes, maximising GPU utilisation compared to static batching.

Definition

Traditional static batching waits until all sequences in a batch have finished before starting the next batch, leaving the GPU idle whenever a short sequence completes. Continuous batching (also called iteration-level scheduling or in-flight batching) inserts new requests into the batch at each decoding iteration. Because the GPU processes a new token for every active sequence per step, newly added requests begin contributing to throughput immediately. This dramatically improves GPU utilisation when sequence lengths vary widely, which is typical in production LLM workloads.

Chunked Prefill Throughput Chapter 4: Software

More Software terms

PagedAttention

vLLM's technique for storing KV cache in non-contiguous memory pages, eliminating fragmentation and enabling larger effective batch sizes.

FlashAttention

IO-aware exact attention algorithm that tiles computation to stay in SRAM, cutting HBM reads/writes and speeding up attention by 2–4×.

vLLM

Open-source LLM inference and serving library from UC Berkeley featuring PagedAttention, continuous batching, and broad model support.

Back to Glossary Start Reading — Chapter 0