PagedAttention
vLLM's technique for storing KV cache in non-contiguous memory pages, eliminating fragmentation and enabling larger effective batch sizes.
Definition
PagedAttention, introduced by Kwon et al. (2023) and implemented in vLLM, borrows the concept of virtual memory paging from operating systems. Instead of reserving a large contiguous chunk of GPU memory per sequence, it divides the KV cache into fixed-size blocks that can be scattered anywhere in VRAM and mapped through a page table. This eliminates the internal and external fragmentation that wastes 20–40% of memory in naive implementations. The result is higher memory utilisation, larger effective batch sizes, and significantly better throughput.
Related
More Software terms
FlashAttention
IO-aware exact attention algorithm that tiles computation to stay in SRAM, cutting HBM reads/writes and speeding up attention by 2–4×.
Continuous Batching
Scheduling technique that adds new requests to a running batch as soon as any sequence finishes, maximising GPU utilisation compared to static batching.
vLLM
Open-source LLM inference and serving library from UC Berkeley featuring PagedAttention, continuous batching, and broad model support.