VRAM
Video RAM — the GPU's dedicated on-chip memory (HBM on datacenter GPUs) holding model weights, KV cache, and activations during inference.
Definition
VRAM is the primary memory pool visible to GPU kernels. For LLM inference, VRAM is partitioned between model weights (static), the KV cache (dynamic, scales with batch size and sequence length), and activations (transient). An H100 SXM has 80 GB of HBM3 VRAM. Fitting a model in VRAM requires the product of parameter count × bytes-per-parameter to be less than available capacity after accounting for the runtime overhead. Exceeding VRAM causes OOM errors or requires model sharding across multiple GPUs.
Related
More Hardware terms
HBM (High Bandwidth Memory)
3D-stacked DRAM technology used in data-centre GPUs, offering memory bandwidth 5–10× higher than GDDR at the cost of smaller capacity.
Memory Bandwidth
The rate at which data can be read from or written to GPU memory, measured in TB/s — the primary bottleneck during autoregressive LLM decoding.
FLOPS
Floating-point operations per second — the peak compute throughput of a GPU, determining how fast compute-bound operations (like prefill) run.