Architecture

Tokenization

The process of splitting text into discrete tokens that form the model's vocabulary, typically via byte-pair encoding (BPE) or similar algorithms.

Definition

Tokenization converts raw text into a sequence of integer token IDs. Most modern LLMs use Byte-Pair Encoding (BPE) or its variants (SentencePiece, tiktoken) to learn a vocabulary of 32K–100K+ subword units that efficiently cover natural-language text. The tokenization step itself is CPU-bound and usually off the critical path in batch inference, but the choice of tokenizer affects sequence lengths — which directly impacts memory, KV cache size, and latency.

Logits Sampling

More Architecture terms

KV Cache

GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.

Multi-Head Attention (MHA)

Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.

Grouped-Query Attention (GQA)

Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.

Back to Glossary Start Reading — Chapter 0