Architecture

Decode Phase

The iterative token-by-token generation phase that follows prefill, where each step extends the KV cache by one row and is memory-bandwidth-bound.

Definition

During the decode phase, the model generates one token at a time. Each step involves a forward pass that attends to all previously generated tokens (via the KV cache) to produce the next token's logits. Because only a single new token is processed per step, the attention matrices are very small, and the workload is dominated by streaming the model's weight matrices from HBM — making it memory-bandwidth-bound. Decode throughput (tokens per second per request) directly determines the output rate and thus time-per-output-token (TPOT).

Prefill Phase Memory Bandwidth KV Cache

More Architecture terms

KV Cache

GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.

Multi-Head Attention (MHA)

Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.

Grouped-Query Attention (GQA)

Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.

Back to Glossary Start Reading — Chapter 0