Skip to content
Architecture

Activations

Intermediate tensor values computed during a model's forward pass, held in VRAM transiently and discarded after each decode step.

Definition

Activations are the output tensors of each layer (attention projections, MLP outputs, layer norms, etc.) as data flows through the forward pass. During training, all activations must be stored for the backward pass, causing memory to scale with the number of layers × batch size × sequence length. During inference, activations are computed and consumed layer by layer — only the current layer's activations need to be in VRAM simultaneously, so the inference activation memory footprint is much smaller than training. However, at very large batch sizes, activation memory can still be significant.

More Architecture terms