Skip to content
Architecture

KV Cache

GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.

Definition

The KV cache stores the key and value projection tensors computed during the attention mechanism for every token that has already been processed. By reusing these tensors during autoregressive decoding, the model avoids recomputing the full self-attention over the growing context at every step. Each token generated adds a row to the cache, so memory usage grows linearly with sequence length and batch size. Efficient KV cache management is one of the central challenges in high-throughput LLM serving.

More Architecture terms