Skip to content
Hardware

Memory Bandwidth

The rate at which data can be read from or written to GPU memory, measured in TB/s — the primary bottleneck during autoregressive LLM decoding.

Definition

Memory bandwidth is the number of bytes per second that the GPU can move between HBM and its compute units. During autoregressive decoding at small batch sizes, the GPU must stream all model weights from HBM once per generated token, but does very little arithmetic — a classic memory-bandwidth-bound workload. A100 SXM offers 2.0 TB/s; H100 SXM offers 3.35 TB/s. Doubling memory bandwidth roughly doubles decode throughput (at batch size 1), making hardware selection and quantization directly relevant to cost-per-token.

More Hardware terms