Hardware

Memory Bandwidth

The rate at which data can be read from or written to GPU memory, measured in TB/s — the primary bottleneck during autoregressive LLM decoding.

Definition

Memory bandwidth is the number of bytes per second that the GPU can move between HBM and its compute units. During autoregressive decoding at small batch sizes, the GPU must stream all model weights from HBM once per generated token, but does very little arithmetic — a classic memory-bandwidth-bound workload. A100 SXM offers 2.0 TB/s; H100 SXM offers 3.35 TB/s. Doubling memory bandwidth roughly doubles decode throughput (at batch size 1), making hardware selection and quantization directly relevant to cost-per-token.

HBM Arithmetic Intensity Roofline Model

More Hardware terms

HBM (High Bandwidth Memory)

3D-stacked DRAM technology used in data-centre GPUs, offering memory bandwidth 5–10× higher than GDDR at the cost of smaller capacity.

VRAM

Video RAM — the GPU's dedicated on-chip memory (HBM on datacenter GPUs) holding model weights, KV cache, and activations during inference.

FLOPS

Floating-point operations per second — the peak compute throughput of a GPU, determining how fast compute-bound operations (like prefill) run.

Back to Glossary Start Reading — Chapter 0