Skip to content
Architecture

Model Weights

The trained parameter tensors of an LLM (embeddings, attention projections, MLP layers) loaded into VRAM at startup and kept resident throughout serving.

Definition

Model weights are the learned parameter values resulting from pre-training and fine-tuning. For inference, all weight tensors are loaded into VRAM at server startup and remain resident throughout the serving lifetime of the model. A 7B-parameter model in BF16 requires approximately 14 GB of VRAM for weights alone (7B × 2 bytes); a 70B model needs ~140 GB. Weight memory is static and predictable, unlike the dynamic KV cache. Quantization reduces bytes-per-parameter (e.g., INT4 reduces 2 bytes to 0.5 bytes), directly cutting weight memory.

More Architecture terms