Skip to content
Software

PagedAttention

vLLM's technique for storing KV cache in non-contiguous memory pages, eliminating fragmentation and enabling larger effective batch sizes.

Definition

PagedAttention, introduced by Kwon et al. (2023) and implemented in vLLM, borrows the concept of virtual memory paging from operating systems. Instead of reserving a large contiguous chunk of GPU memory per sequence, it divides the KV cache into fixed-size blocks that can be scattered anywhere in VRAM and mapped through a page table. This eliminates the internal and external fragmentation that wastes 20–40% of memory in naive implementations. The result is higher memory utilisation, larger effective batch sizes, and significantly better throughput.

More Software terms