Model Weights
The trained parameter tensors of an LLM (embeddings, attention projections, MLP layers) loaded into VRAM at startup and kept resident throughout serving.
Definition
Model weights are the learned parameter values resulting from pre-training and fine-tuning. For inference, all weight tensors are loaded into VRAM at server startup and remain resident throughout the serving lifetime of the model. A 7B-parameter model in BF16 requires approximately 14 GB of VRAM for weights alone (7B × 2 bytes); a 70B model needs ~140 GB. Weight memory is static and predictable, unlike the dynamic KV cache. Quantization reduces bytes-per-parameter (e.g., INT4 reduces 2 bytes to 0.5 bytes), directly cutting weight memory.
Related
More Architecture terms
KV Cache
GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.
Multi-Head Attention (MHA)
Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.
Grouped-Query Attention (GQA)
Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.