Software

SGLang

LLM inference runtime from Stanford LMSYS featuring RadixAttention, speculative execution, and structured generation support.

Definition

SGLang (Structured Generation Language) is a high-performance LLM serving framework from the Stanford LMSYS group. Its distinguishing features include RadixAttention for generalised prefix caching, native support for structured generation and constrained decoding (JSON, regex), and an efficient runtime for multi-call programs. SGLang is particularly well-suited for agent workflows that reuse contexts across multiple LLM calls. Benchmarks show SGLang matching or exceeding vLLM throughput on many workloads, especially those with shared prefixes.

RadixAttention vLLM Chapter 4: Software

More Software terms

PagedAttention

vLLM's technique for storing KV cache in non-contiguous memory pages, eliminating fragmentation and enabling larger effective batch sizes.

FlashAttention

IO-aware exact attention algorithm that tiles computation to stay in SRAM, cutting HBM reads/writes and speeding up attention by 2–4×.

Continuous Batching

Scheduling technique that adds new requests to a running batch as soon as any sequence finishes, maximising GPU utilisation compared to static batching.

Back to Glossary Start Reading — Chapter 0