FlashAttention
IO-aware exact attention algorithm that tiles computation to stay in SRAM, cutting HBM reads/writes and speeding up attention by 2–4×.
Definition
FlashAttention (Dao et al., 2022) restructures the standard O(N²) attention computation into smaller tiles that fit in the on-chip SRAM of a GPU. By fusing the softmax, matmul, and dropout operations into a single kernel, it avoids writing intermediate NxN attention matrices to HBM, reducing memory reads/writes from O(N²) to O(N). This results in both faster wall-clock attention and lower peak memory usage. FlashAttention v2 and v3 have further refined the tiling strategy for H100 Tensor Cores.
Related
More Software terms
PagedAttention
vLLM's technique for storing KV cache in non-contiguous memory pages, eliminating fragmentation and enabling larger effective batch sizes.
Continuous Batching
Scheduling technique that adds new requests to a running batch as soon as any sequence finishes, maximising GPU utilisation compared to static batching.
vLLM
Open-source LLM inference and serving library from UC Berkeley featuring PagedAttention, continuous batching, and broad model support.