Architecture

Grouped-Query Attention (GQA)

Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.

Definition

Grouped-Query Attention (GQA), introduced by Ainslie et al. (2023), lies between MHA and MQA on the quality-efficiency spectrum. Query heads are divided into G groups, and each group shares a single set of key and value heads. With G groups and H query heads, the KV cache is G/H the size of MHA. Models such as Llama 3 and Mistral use GQA as their default attention mechanism because it dramatically reduces memory bandwidth pressure during decoding with minimal quality loss.

Multi-Head Attention (MHA)Multi-Query Attention (MQA)KV Cache

More Architecture terms

KV Cache

GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.

Multi-Head Attention (MHA)

Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.

Multi-Query Attention (MQA)

Extreme attention variant using a single shared K/V head for all query heads, minimising KV cache at the cost of some model quality.

Back to Glossary Start Reading — Chapter 0