Skip to content
Software

vLLM

Open-source LLM inference and serving library from UC Berkeley featuring PagedAttention, continuous batching, and broad model support.

Definition

vLLM is a high-throughput and memory-efficient LLM serving system developed at UC Berkeley. Its primary innovations are PagedAttention (non-contiguous KV cache paging) and iteration-level continuous batching. vLLM supports a wide range of model architectures (Llama, Mistral, Qwen, Falcon, etc.), quantization formats (GPTQ, AWQ, FP8), and hardware backends (CUDA, ROCm, TPU). It exposes an OpenAI-compatible REST API and is one of the most widely deployed open-source inference runtimes in production.

More Software terms