Architecture

Model Weights

The trained parameter tensors of an LLM (embeddings, attention projections, MLP layers) loaded into VRAM at startup and kept resident throughout serving.

Definition

Model weights are the learned parameter values resulting from pre-training and fine-tuning. For inference, all weight tensors are loaded into VRAM at server startup and remain resident throughout the serving lifetime of the model. A 7B-parameter model in BF16 requires approximately 14 GB of VRAM for weights alone (7B × 2 bytes); a 70B model needs ~140 GB. Weight memory is static and predictable, unlike the dynamic KV cache. Quantization reduces bytes-per-parameter (e.g., INT4 reduces 2 bytes to 0.5 bytes), directly cutting weight memory.

VRAM Quantization

More Architecture terms

KV Cache

GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.

Multi-Head Attention (MHA)

Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.

Grouped-Query Attention (GQA)

Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.

Back to Glossary Start Reading — Chapter 0