Embedding
Dense vector representation of a token in a high-dimensional space, learned during training to encode semantic and syntactic relationships.
Definition
An embedding maps a discrete token ID to a continuous vector of dimension D (e.g., 4096 for Llama 3). The embedding table is typically the largest parameter tensor in a model and is shared with the language-model head. At inference time, the input embedding lookup is trivial, but the final projection from the model's hidden states back to the vocabulary (the unembedding/lm_head operation) is a large matrix multiply that contributes meaningfully to latency at small batch sizes.
Related
More Architecture terms
KV Cache
GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.
Multi-Head Attention (MHA)
Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.
Grouped-Query Attention (GQA)
Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.