Software

FlashAttention

IO-aware exact attention algorithm that tiles computation to stay in SRAM, cutting HBM reads/writes and speeding up attention by 2–4×.

Definition

FlashAttention (Dao et al., 2022) restructures the standard O(N²) attention computation into smaller tiles that fit in the on-chip SRAM of a GPU. By fusing the softmax, matmul, and dropout operations into a single kernel, it avoids writing intermediate NxN attention matrices to HBM, reducing memory reads/writes from O(N²) to O(N). This results in both faster wall-clock attention and lower peak memory usage. FlashAttention v2 and v3 have further refined the tiling strategy for H100 Tensor Cores.

Memory Bandwidth Arithmetic Intensity Chapter 3: Hardware

More Software terms

PagedAttention

vLLM's technique for storing KV cache in non-contiguous memory pages, eliminating fragmentation and enabling larger effective batch sizes.

Continuous Batching

Scheduling technique that adds new requests to a running batch as soon as any sequence finishes, maximising GPU utilisation compared to static batching.

vLLM

Open-source LLM inference and serving library from UC Berkeley featuring PagedAttention, continuous batching, and broad model support.

Back to Glossary Start Reading — Chapter 0