Architecture

Sampling

Stochastic token selection from the model's output probability distribution, as opposed to greedy (argmax) decoding.

Definition

Sampling selects the next token by drawing from the probability distribution defined by the softmax of the logits. Common variants include top-k sampling (restricts the distribution to the top k most probable tokens) and nucleus (top-p) sampling (restricts to the smallest set of tokens whose cumulative probability exceeds p). Sampling introduces controlled randomness that improves output diversity and reduces repetition compared to greedy decoding. Temperature scaling the logits before softmax allows tuning the sharpness of the distribution.

Temperature Beam Search Logits

More Architecture terms

KV Cache

GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.

Multi-Head Attention (MHA)

Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.

Grouped-Query Attention (GQA)

Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.

Back to Glossary Start Reading — Chapter 0