RadixAttention
SGLang's generalised prefix caching using an LRU radix tree to reuse any shared KV cache subtree across requests, not just common prefixes.
Definition
RadixAttention, introduced in SGLang, structures cached KV blocks as a radix (prefix) tree indexed by token sequences. When a new request arrives, the system finds the longest matching prefix in the tree and reuses those blocks, then extends the tree with the new portion of the context. Using an LRU eviction policy, the tree grows to fill available VRAM. This generalises simple prefix caching to arbitrary shared sub-sequences (e.g., multi-turn conversations that diverge at different points) and can dramatically reduce time-to-first-token on requests with long repeated prefixes.
Related
More Optimization terms
Quantization
Reducing model weight (and optionally activation) precision from FP16/BF16 to INT8, FP8, or INT4 to cut VRAM and increase throughput.
INT8
8-bit integer quantization for model weights and/or activations, roughly halving memory vs. FP16 with small accuracy degradation.
FP8
8-bit floating-point format (E4M3 or E5M2) natively supported on H100/H200 GPUs, enabling faster matmuls with minimal accuracy loss vs. FP16.