Skip to content
Optimization

INT8

8-bit integer quantization for model weights and/or activations, roughly halving memory vs. FP16 with small accuracy degradation.

Definition

INT8 quantization represents each weight value as an 8-bit signed integer plus a per-tensor or per-channel scale factor. Compared to FP16, INT8 halves the weight memory footprint and doubles effective memory bandwidth utilisation on hardware that supports INT8 tensor operations (e.g., NVIDIA A100/H100 via cuBLAS). Activation quantization (also called W8A8) is trickier because activations contain outliers; SmoothQuant migrates outlier difficulty from activations to weights to make W8A8 feasible at scale.

More Optimization terms