Skip to content
Architecture

Multi-Head Attention (MHA)

Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.

Definition

Multi-Head Attention (MHA) runs self-attention in parallel across H independent heads, each learning to attend to different relational patterns. Each head has its own learned query, key, and value weight matrices, so the KV cache per token scales linearly with the number of heads. While MHA offers the richest representation capacity, the large KV cache footprint at inference time has motivated the development of more memory-efficient variants like GQA and MQA.

More Architecture terms