Architecture

Activations

Intermediate tensor values computed during a model's forward pass, held in VRAM transiently and discarded after each decode step.

Definition

Activations are the output tensors of each layer (attention projections, MLP outputs, layer norms, etc.) as data flows through the forward pass. During training, all activations must be stored for the backward pass, causing memory to scale with the number of layers × batch size × sequence length. During inference, activations are computed and consumed layer by layer — only the current layer's activations need to be in VRAM simultaneously, so the inference activation memory footprint is much smaller than training. However, at very large batch sizes, activation memory can still be significant.

VRAM Prefill Phase

More Architecture terms

KV Cache

GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.

Multi-Head Attention (MHA)

Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.

Grouped-Query Attention (GQA)

Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.

Back to Glossary Start Reading — Chapter 0