Software

vLLM

Open-source LLM inference and serving library from UC Berkeley featuring PagedAttention, continuous batching, and broad model support.

Definition

vLLM is a high-throughput and memory-efficient LLM serving system developed at UC Berkeley. Its primary innovations are PagedAttention (non-contiguous KV cache paging) and iteration-level continuous batching. vLLM supports a wide range of model architectures (Llama, Mistral, Qwen, Falcon, etc.), quantization formats (GPTQ, AWQ, FP8), and hardware backends (CUDA, ROCm, TPU). It exposes an OpenAI-compatible REST API and is one of the most widely deployed open-source inference runtimes in production.

PagedAttention SGLang TensorRT-LLM

More Software terms

PagedAttention

vLLM's technique for storing KV cache in non-contiguous memory pages, eliminating fragmentation and enabling larger effective batch sizes.

FlashAttention

IO-aware exact attention algorithm that tiles computation to stay in SRAM, cutting HBM reads/writes and speeding up attention by 2–4×.

Continuous Batching

Scheduling technique that adds new requests to a running batch as soon as any sequence finishes, maximising GPU utilisation compared to static batching.

Back to Glossary Start Reading — Chapter 0