Skip to content
Software

FlashAttention

IO-aware exact attention algorithm that tiles computation to stay in SRAM, cutting HBM reads/writes and speeding up attention by 2–4×.

Definition

FlashAttention (Dao et al., 2022) restructures the standard O(N²) attention computation into smaller tiles that fit in the on-chip SRAM of a GPU. By fusing the softmax, matmul, and dropout operations into a single kernel, it avoids writing intermediate NxN attention matrices to HBM, reducing memory reads/writes from O(N²) to O(N). This results in both faster wall-clock attention and lower peak memory usage. FlashAttention v2 and v3 have further refined the tiling strategy for H100 Tensor Cores.

More Software terms