Skip to content
Architecture

Context Window

The maximum number of tokens a model can attend to in a single forward pass, encompassing both the prompt and generated output.

Definition

The context window defines the upper bound on how many tokens a model processes at once. Tokens within the window are all mutually visible through self-attention, while tokens outside are not attended to. Longer context windows allow richer prompts, multi-document reasoning, and longer generated sequences, but they scale the KV cache proportionally and increase the quadratic cost of attention. Models such as Gemini and Claude 3.5 support context windows of 100K–1M tokens, which requires careful memory management to serve efficiently.

More Architecture terms