Multi-Query Attention (MQA)
Extreme attention variant using a single shared K/V head for all query heads, minimising KV cache at the cost of some model quality.
Definition
Multi-Query Attention (MQA), proposed by Shazeer (2019), takes memory efficiency to the limit by having all query heads share a single key head and a single value head. This reduces the KV cache to 1/H the size of MHA. While MQA speeds up memory-bandwidth-bound decoding considerably, the reduction in representational capacity can hurt quality on complex tasks. GQA is now more commonly preferred as it recovers most of the quality while keeping most of the memory savings.
Related
More Architecture terms
KV Cache
GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.
Multi-Head Attention (MHA)
Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.
Grouped-Query Attention (GQA)
Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.