Architecture

Embedding

Dense vector representation of a token in a high-dimensional space, learned during training to encode semantic and syntactic relationships.

Definition

An embedding maps a discrete token ID to a continuous vector of dimension D (e.g., 4096 for Llama 3). The embedding table is typically the largest parameter tensor in a model and is shared with the language-model head. At inference time, the input embedding lookup is trivial, but the final projection from the model's hidden states back to the vocabulary (the unembedding/lm_head operation) is a large matrix multiply that contributes meaningfully to latency at small batch sizes.

Tokenization Logits

More Architecture terms

KV Cache

GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.

Multi-Head Attention (MHA)

Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.

Grouped-Query Attention (GQA)

Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.

Back to Glossary Start Reading — Chapter 0