Multi-Head Attention (MHA)
Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.
Definition
Multi-Head Attention (MHA) runs self-attention in parallel across H independent heads, each learning to attend to different relational patterns. Each head has its own learned query, key, and value weight matrices, so the KV cache per token scales linearly with the number of heads. While MHA offers the richest representation capacity, the large KV cache footprint at inference time has motivated the development of more memory-efficient variants like GQA and MQA.
Related
More Architecture terms
KV Cache
GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.
Grouped-Query Attention (GQA)
Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.
Multi-Query Attention (MQA)
Extreme attention variant using a single shared K/V head for all query heads, minimising KV cache at the cost of some model quality.