Hardware

VRAM

Video RAM — the GPU's dedicated on-chip memory (HBM on datacenter GPUs) holding model weights, KV cache, and activations during inference.

Definition

VRAM is the primary memory pool visible to GPU kernels. For LLM inference, VRAM is partitioned between model weights (static), the KV cache (dynamic, scales with batch size and sequence length), and activations (transient). An H100 SXM has 80 GB of HBM3 VRAM. Fitting a model in VRAM requires the product of parameter count × bytes-per-parameter to be less than available capacity after accounting for the runtime overhead. Exceeding VRAM causes OOM errors or requires model sharding across multiple GPUs.

HBM KV Cache Chapter 3: Hardware

More Hardware terms

HBM (High Bandwidth Memory)

3D-stacked DRAM technology used in data-centre GPUs, offering memory bandwidth 5–10× higher than GDDR at the cost of smaller capacity.

Memory Bandwidth

The rate at which data can be read from or written to GPU memory, measured in TB/s — the primary bottleneck during autoregressive LLM decoding.

FLOPS

Floating-point operations per second — the peak compute throughput of a GPU, determining how fast compute-bound operations (like prefill) run.

Back to Glossary Start Reading — Chapter 0