Architecture

KV Cache

GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.

Definition

The KV cache stores the key and value projection tensors computed during the attention mechanism for every token that has already been processed. By reusing these tensors during autoregressive decoding, the model avoids recomputing the full self-attention over the growing context at every step. Each token generated adds a row to the cache, so memory usage grows linearly with sequence length and batch size. Efficient KV cache management is one of the central challenges in high-throughput LLM serving.

PagedAttention Prefix Caching Chapter 4: Software

More Architecture terms

Multi-Head Attention (MHA)

Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.

Grouped-Query Attention (GQA)

Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.

Multi-Query Attention (MQA)

Extreme attention variant using a single shared K/V head for all query heads, minimising KV cache at the cost of some model quality.

Back to Glossary Start Reading — Chapter 0