Optimization

INT8

8-bit integer quantization for model weights and/or activations, roughly halving memory vs. FP16 with small accuracy degradation.

Definition

INT8 quantization represents each weight value as an 8-bit signed integer plus a per-tensor or per-channel scale factor. Compared to FP16, INT8 halves the weight memory footprint and doubles effective memory bandwidth utilisation on hardware that supports INT8 tensor operations (e.g., NVIDIA A100/H100 via cuBLAS). Activation quantization (also called W8A8) is trickier because activations contain outliers; SmoothQuant migrates outlier difficulty from activations to weights to make W8A8 feasible at scale.

FP8 SmoothQuant Chapter 5: Techniques

More Optimization terms

Quantization

Reducing model weight (and optionally activation) precision from FP16/BF16 to INT8, FP8, or INT4 to cut VRAM and increase throughput.

FP8

8-bit floating-point format (E4M3 or E5M2) natively supported on H100/H200 GPUs, enabling faster matmuls with minimal accuracy loss vs. FP16.

AWQ (Activation-aware Weight Quantization)

Weight-only quantization method that protects the 1% of channels salient for output quality, enabling accurate 4-bit inference.

Back to Glossary Start Reading — Chapter 0

INT8

Definition

Related

More Optimization terms

Quantization

FP8

AWQ (Activation-aware Weight Quantization)