vLLM
Open-source LLM inference and serving library from UC Berkeley featuring PagedAttention, continuous batching, and broad model support.
Definition
vLLM is a high-throughput and memory-efficient LLM serving system developed at UC Berkeley. Its primary innovations are PagedAttention (non-contiguous KV cache paging) and iteration-level continuous batching. vLLM supports a wide range of model architectures (Llama, Mistral, Qwen, Falcon, etc.), quantization formats (GPTQ, AWQ, FP8), and hardware backends (CUDA, ROCm, TPU). It exposes an OpenAI-compatible REST API and is one of the most widely deployed open-source inference runtimes in production.
Related
More Software terms
PagedAttention
vLLM's technique for storing KV cache in non-contiguous memory pages, eliminating fragmentation and enabling larger effective batch sizes.
FlashAttention
IO-aware exact attention algorithm that tiles computation to stay in SRAM, cutting HBM reads/writes and speeding up attention by 2–4×.
Continuous Batching
Scheduling technique that adds new requests to a running batch as soon as any sequence finishes, maximising GPU utilisation compared to static batching.