Skip to content
Architecture

Multi-Query Attention (MQA)

Extreme attention variant using a single shared K/V head for all query heads, minimising KV cache at the cost of some model quality.

Definition

Multi-Query Attention (MQA), proposed by Shazeer (2019), takes memory efficiency to the limit by having all query heads share a single key head and a single value head. This reduces the KV cache to 1/H the size of MHA. While MQA speeds up memory-bandwidth-bound decoding considerably, the reduction in representational capacity can hurt quality on complex tasks. GQA is now more commonly preferred as it recovers most of the quality while keeping most of the memory savings.

More Architecture terms