Hardware

Tensor Core

Specialised matrix-multiply-accumulate units in NVIDIA GPUs that execute fused D=A×B+C on tiles at up to 16× the throughput of CUDA cores.

Definition

Tensor Cores are NVIDIA's specialised matrix-multiply hardware, introduced in Volta (V100) and refined in Ampere (A100), Hopper (H100), and Blackwell (B200). They execute a fused D = A × B + C matrix operation on small tiles (e.g., 16×16 or 8×16×16) in a single clock cycle using reduced-precision inputs (FP16, BF16, FP8, INT8, INT4). The key to reaching peak FLOPS is keeping Tensor Cores fed with data; FlashAttention and efficient kernel tiling strategies are designed specifically to maximise Tensor Core utilisation.

FLOPS FlashAttention FP8

More Hardware terms

HBM (High Bandwidth Memory)

3D-stacked DRAM technology used in data-centre GPUs, offering memory bandwidth 5–10× higher than GDDR at the cost of smaller capacity.

VRAM

Video RAM — the GPU's dedicated on-chip memory (HBM on datacenter GPUs) holding model weights, KV cache, and activations during inference.

Memory Bandwidth

The rate at which data can be read from or written to GPU memory, measured in TB/s — the primary bottleneck during autoregressive LLM decoding.

Back to Glossary Start Reading — Chapter 0