Prefix Caching
Reusing KV cache blocks computed for a shared prompt prefix across multiple requests, eliminating redundant prefill computation.
Definition
Prefix caching exploits the observation that many requests share a common prefix — such as a system prompt, a few-shot example, or a document context. By storing the KV cache blocks for that prefix and reusing them for subsequent requests, the prefill computation for the shared portion is eliminated entirely. vLLM and SGLang both implement automatic prefix caching. RadixAttention in SGLang generalises this to an LRU radix tree structure that can match any shared subtree, not just exact prefixes.
Related
More Optimization terms
Quantization
Reducing model weight (and optionally activation) precision from FP16/BF16 to INT8, FP8, or INT4 to cut VRAM and increase throughput.
INT8
8-bit integer quantization for model weights and/or activations, roughly halving memory vs. FP16 with small accuracy degradation.
FP8
8-bit floating-point format (E4M3 or E5M2) natively supported on H100/H200 GPUs, enabling faster matmuls with minimal accuracy loss vs. FP16.