Skip to content
Optimization

Chunked Prefill

Breaking large prefill requests into smaller token chunks so the GPU can interleave prefill and decode work, reducing TTFT for waiting decode requests.

Definition

In continuous batching, a large prefill job can monopolise the GPU for tens of milliseconds, stalling all in-flight decode requests (a phenomenon called prefill preemption). Chunked prefill splits the prompt into chunks of C tokens (e.g., 512 or 1024) and processes one chunk per iteration, interleaving prefill chunks with decode steps. This bounds the worst-case scheduling latency and reduces p99 TTFT. vLLM v0.4+ and SGLang support chunked prefill as a default or configurable option.

More Optimization terms