Skip to content
Architecture

Grouped-Query Attention (GQA)

Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.

Definition

Grouped-Query Attention (GQA), introduced by Ainslie et al. (2023), lies between MHA and MQA on the quality-efficiency spectrum. Query heads are divided into G groups, and each group shares a single set of key and value heads. With G groups and H query heads, the KV cache is G/H the size of MHA. Models such as Llama 3 and Mistral use GQA as their default attention mechanism because it dramatically reduces memory bandwidth pressure during decoding with minimal quality loss.

More Architecture terms