Beam Search
Deterministic decoding that maintains the top-K highest-probability partial sequences at each step, used in translation but rarely in modern LLM chat.
Definition
Beam search maintains B candidate sequences (beams) at each step, expanding each beam with the top tokens and keeping only the B sequences with the highest cumulative log-probability. It approximates the globally optimal sequence more closely than greedy decoding but requires B full forward passes per step, multiplying compute and memory requirements by B. Beam search is standard in machine translation and summarization tasks with short, well-defined outputs, but for open-ended generation, sampling methods tend to produce more diverse and natural-sounding text.
Related
More Architecture terms
KV Cache
GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.
Multi-Head Attention (MHA)
Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.
Grouped-Query Attention (GQA)
Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.