Grouped-Query Attention (GQA)
Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.
Definition
Grouped-Query Attention (GQA), introduced by Ainslie et al. (2023), lies between MHA and MQA on the quality-efficiency spectrum. Query heads are divided into G groups, and each group shares a single set of key and value heads. With G groups and H query heads, the KV cache is G/H the size of MHA. Models such as Llama 3 and Mistral use GQA as their default attention mechanism because it dramatically reduces memory bandwidth pressure during decoding with minimal quality loss.
Related
More Architecture terms
KV Cache
GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.
Multi-Head Attention (MHA)
Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.
Multi-Query Attention (MQA)
Extreme attention variant using a single shared K/V head for all query heads, minimising KV cache at the cost of some model quality.