Architecture

Context Window

The maximum number of tokens a model can attend to in a single forward pass, encompassing both the prompt and generated output.

Definition

The context window defines the upper bound on how many tokens a model processes at once. Tokens within the window are all mutually visible through self-attention, while tokens outside are not attended to. Longer context windows allow richer prompts, multi-document reasoning, and longer generated sequences, but they scale the KV cache proportionally and increase the quadratic cost of attention. Models such as Gemini and Claude 3.5 support context windows of 100K–1M tokens, which requires careful memory management to serve efficiently.

KV Cache Chunked Prefill

More Architecture terms

KV Cache

GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.

Multi-Head Attention (MHA)

Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.

Grouped-Query Attention (GQA)

Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.

Back to Glossary Start Reading — Chapter 0