Architecture

Beam Search

Deterministic decoding that maintains the top-K highest-probability partial sequences at each step, used in translation but rarely in modern LLM chat.

Definition

Beam search maintains B candidate sequences (beams) at each step, expanding each beam with the top tokens and keeping only the B sequences with the highest cumulative log-probability. It approximates the globally optimal sequence more closely than greedy decoding but requires B full forward passes per step, multiplying compute and memory requirements by B. Beam search is standard in machine translation and summarization tasks with short, well-defined outputs, but for open-ended generation, sampling methods tend to produce more diverse and natural-sounding text.

Sampling Temperature

More Architecture terms

KV Cache

GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.

Multi-Head Attention (MHA)

Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.

Grouped-Query Attention (GQA)

Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.

Back to Glossary Start Reading — Chapter 0