Architecture

Prefill Phase

The initial forward pass that processes the full input prompt in parallel, producing the first output token and populating the KV cache.

Definition

The prefill phase takes the entire input prompt (potentially thousands of tokens) and processes it in a single parallel forward pass through the model. Because all prompt tokens are processed simultaneously with known inputs, the QKV projections are large dense matrix multiplies — a compute-bound workload. The prefill pass generates the KV cache entries for all prompt tokens and produces the logits for the first generated token. Prefill latency is the dominant contributor to time-to-first-token (TTFT).

Decode Phase TTFT Chunked Prefill

More Architecture terms

KV Cache

GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.

Multi-Head Attention (MHA)

Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.

Grouped-Query Attention (GQA)

Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.

Back to Glossary Start Reading — Chapter 0