Prefill Phase
The initial forward pass that processes the full input prompt in parallel, producing the first output token and populating the KV cache.
Definition
The prefill phase takes the entire input prompt (potentially thousands of tokens) and processes it in a single parallel forward pass through the model. Because all prompt tokens are processed simultaneously with known inputs, the QKV projections are large dense matrix multiplies — a compute-bound workload. The prefill pass generates the KV cache entries for all prompt tokens and produces the logits for the first generated token. Prefill latency is the dominant contributor to time-to-first-token (TTFT).
Related
More Architecture terms
KV Cache
GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.
Multi-Head Attention (MHA)
Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.
Grouped-Query Attention (GQA)
Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.