Skip to content
Hardware

Tensor Core

Specialised matrix-multiply-accumulate units in NVIDIA GPUs that execute fused D=A×B+C on tiles at up to 16× the throughput of CUDA cores.

Definition

Tensor Cores are NVIDIA's specialised matrix-multiply hardware, introduced in Volta (V100) and refined in Ampere (A100), Hopper (H100), and Blackwell (B200). They execute a fused D = A × B + C matrix operation on small tiles (e.g., 16×16 or 8×16×16) in a single clock cycle using reduced-precision inputs (FP16, BF16, FP8, INT8, INT4). The key to reaching peak FLOPS is keeping Tensor Cores fed with data; FlashAttention and efficient kernel tiling strategies are designed specifically to maximise Tensor Core utilisation.

More Hardware terms