Optimization

SmoothQuant

W8A8 quantization technique that migrates quantization difficulty from activations to weights via a mathematically equivalent per-channel scaling.

Definition

SmoothQuant (Xiao et al., 2022) addresses the challenge of quantizing activations, which contain large outliers that undermine per-tensor INT8 quantization. It applies a channel-wise scaling factor that reduces activation variance while inversely scaling the corresponding weight channels — a mathematically equivalent transformation — so that both activations and weights become quantization-friendly. The result is a W8A8 model with near-FP16 accuracy that enables hardware-efficient inference on GPUs with INT8 Tensor Core support.

INT8 Quantization Chapter 5: Techniques

More Optimization terms

Quantization

Reducing model weight (and optionally activation) precision from FP16/BF16 to INT8, FP8, or INT4 to cut VRAM and increase throughput.

INT8

8-bit integer quantization for model weights and/or activations, roughly halving memory vs. FP16 with small accuracy degradation.

FP8

8-bit floating-point format (E4M3 or E5M2) natively supported on H100/H200 GPUs, enabling faster matmuls with minimal accuracy loss vs. FP16.

Back to Glossary Start Reading — Chapter 0

SmoothQuant

Definition

Related

More Optimization terms

Quantization

INT8

FP8