Software

PagedAttention

vLLM's technique for storing KV cache in non-contiguous memory pages, eliminating fragmentation and enabling larger effective batch sizes.

Definition

PagedAttention, introduced by Kwon et al. (2023) and implemented in vLLM, borrows the concept of virtual memory paging from operating systems. Instead of reserving a large contiguous chunk of GPU memory per sequence, it divides the KV cache into fixed-size blocks that can be scattered anywhere in VRAM and mapped through a page table. This eliminates the internal and external fragmentation that wastes 20–40% of memory in naive implementations. The result is higher memory utilisation, larger effective batch sizes, and significantly better throughput.

KV Cache vLLM Chapter 4: Software

More Software terms

FlashAttention

IO-aware exact attention algorithm that tiles computation to stay in SRAM, cutting HBM reads/writes and speeding up attention by 2–4×.

Continuous Batching

Scheduling technique that adds new requests to a running batch as soon as any sequence finishes, maximising GPU utilisation compared to static batching.

vLLM

Open-source LLM inference and serving library from UC Berkeley featuring PagedAttention, continuous batching, and broad model support.

Back to Glossary Start Reading — Chapter 0