Skip to content
Optimization

RadixAttention

SGLang's generalised prefix caching using an LRU radix tree to reuse any shared KV cache subtree across requests, not just common prefixes.

Definition

RadixAttention, introduced in SGLang, structures cached KV blocks as a radix (prefix) tree indexed by token sequences. When a new request arrives, the system finds the longest matching prefix in the tree and reuses those blocks, then extends the tree with the new portion of the context. Using an LRU eviction policy, the tree grows to fill available VRAM. This generalises simple prefix caching to arbitrary shared sub-sequences (e.g., multi-turn conversations that diverge at different points) and can dramatically reduce time-to-first-token on requests with long repeated prefixes.

More Optimization terms