Skip to content
Architecture

Tokenization

The process of splitting text into discrete tokens that form the model's vocabulary, typically via byte-pair encoding (BPE) or similar algorithms.

Definition

Tokenization converts raw text into a sequence of integer token IDs. Most modern LLMs use Byte-Pair Encoding (BPE) or its variants (SentencePiece, tiktoken) to learn a vocabulary of 32K–100K+ subword units that efficiently cover natural-language text. The tokenization step itself is CPU-bound and usually off the critical path in batch inference, but the choice of tokenizer affects sequence lengths — which directly impacts memory, KV cache size, and latency.

More Architecture terms