Optimization

Chunked Prefill

Breaking large prefill requests into smaller token chunks so the GPU can interleave prefill and decode work, reducing TTFT for waiting decode requests.

Definition

In continuous batching, a large prefill job can monopolise the GPU for tens of milliseconds, stalling all in-flight decode requests (a phenomenon called prefill preemption). Chunked prefill splits the prompt into chunks of C tokens (e.g., 512 or 1024) and processes one chunk per iteration, interleaving prefill chunks with decode steps. This bounds the worst-case scheduling latency and reduces p99 TTFT. vLLM v0.4+ and SGLang support chunked prefill as a default or configurable option.

Continuous Batching Prefill Phase TTFT

More Optimization terms

Quantization

Reducing model weight (and optionally activation) precision from FP16/BF16 to INT8, FP8, or INT4 to cut VRAM and increase throughput.

INT8

8-bit integer quantization for model weights and/or activations, roughly halving memory vs. FP16 with small accuracy degradation.

FP8

8-bit floating-point format (E4M3 or E5M2) natively supported on H100/H200 GPUs, enabling faster matmuls with minimal accuracy loss vs. FP16.

Back to Glossary Start Reading — Chapter 0

Chunked Prefill

Definition

Related

More Optimization terms

Quantization

INT8

FP8