Architecture

Logits

Raw, unnormalised scores over the vocabulary produced by the model's final linear layer before softmax is applied.

Definition

Logits are the output of the language model head: a vector of length V (vocabulary size, e.g., 128,256 for Llama 3) representing the unnormalised log-probability of each possible next token. Applying softmax to the logits yields a proper probability distribution. In constrained decoding systems such as vLLM or SGLang, the logits may be post-processed with penalties, biases, or grammar masks before sampling occurs. Computing logits is a large matmul (hidden_dim × vocab) and is often memory-bandwidth bound.

Sampling Temperature

More Architecture terms

KV Cache

GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.

Multi-Head Attention (MHA)

Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.

Grouped-Query Attention (GQA)

Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.

Back to Glossary Start Reading — Chapter 0