Skip to content
Hardware

VRAM

Video RAM — the GPU's dedicated on-chip memory (HBM on datacenter GPUs) holding model weights, KV cache, and activations during inference.

Definition

VRAM is the primary memory pool visible to GPU kernels. For LLM inference, VRAM is partitioned between model weights (static), the KV cache (dynamic, scales with batch size and sequence length), and activations (transient). An H100 SXM has 80 GB of HBM3 VRAM. Fitting a model in VRAM requires the product of parameter count × bytes-per-parameter to be less than available capacity after accounting for the runtime overhead. Exceeding VRAM causes OOM errors or requires model sharding across multiple GPUs.

More Hardware terms