Skip to content
Architecture

Decode Phase

The iterative token-by-token generation phase that follows prefill, where each step extends the KV cache by one row and is memory-bandwidth-bound.

Definition

During the decode phase, the model generates one token at a time. Each step involves a forward pass that attends to all previously generated tokens (via the KV cache) to produce the next token's logits. Because only a single new token is processed per step, the attention matrices are very small, and the workload is dominated by streaming the model's weight matrices from HBM — making it memory-bandwidth-bound. Decode throughput (tokens per second per request) directly determines the output rate and thus time-per-output-token (TPOT).

More Architecture terms