Memory Bandwidth
The rate at which data can be read from or written to GPU memory, measured in TB/s — the primary bottleneck during autoregressive LLM decoding.
Definition
Memory bandwidth is the number of bytes per second that the GPU can move between HBM and its compute units. During autoregressive decoding at small batch sizes, the GPU must stream all model weights from HBM once per generated token, but does very little arithmetic — a classic memory-bandwidth-bound workload. A100 SXM offers 2.0 TB/s; H100 SXM offers 3.35 TB/s. Doubling memory bandwidth roughly doubles decode throughput (at batch size 1), making hardware selection and quantization directly relevant to cost-per-token.
Related
More Hardware terms
HBM (High Bandwidth Memory)
3D-stacked DRAM technology used in data-centre GPUs, offering memory bandwidth 5–10× higher than GDDR at the cost of smaller capacity.
VRAM
Video RAM — the GPU's dedicated on-chip memory (HBM on datacenter GPUs) holding model weights, KV cache, and activations during inference.
FLOPS
Floating-point operations per second — the peak compute throughput of a GPU, determining how fast compute-bound operations (like prefill) run.