Tokenization
The process of splitting text into discrete tokens that form the model's vocabulary, typically via byte-pair encoding (BPE) or similar algorithms.
Definition
Tokenization converts raw text into a sequence of integer token IDs. Most modern LLMs use Byte-Pair Encoding (BPE) or its variants (SentencePiece, tiktoken) to learn a vocabulary of 32K–100K+ subword units that efficiently cover natural-language text. The tokenization step itself is CPU-bound and usually off the critical path in batch inference, but the choice of tokenizer affects sequence lengths — which directly impacts memory, KV cache size, and latency.
Related
More Architecture terms
KV Cache
GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.
Multi-Head Attention (MHA)
Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.
Grouped-Query Attention (GQA)
Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.