Optimization

Quantization

Reducing model weight (and optionally activation) precision from FP16/BF16 to INT8, FP8, or INT4 to cut VRAM and increase throughput.

Definition

Quantization represents weights and sometimes activations using fewer bits than the training-time FP32 or BF16 precision. The most common targets are INT8 (8-bit integer) and INT4/FP8 (4 or 8 bits). Weights at lower precision occupy less VRAM and can be loaded faster from HBM, which is especially beneficial for memory-bandwidth-bound decoding. The trade-off is potential accuracy loss; calibration-based and outlier-aware methods (GPTQ, AWQ, SmoothQuant) are used to maintain quality.

INT8 FP8 AWQ GPTQ

More Optimization terms

INT8

8-bit integer quantization for model weights and/or activations, roughly halving memory vs. FP16 with small accuracy degradation.

FP8

8-bit floating-point format (E4M3 or E5M2) natively supported on H100/H200 GPUs, enabling faster matmuls with minimal accuracy loss vs. FP16.

AWQ (Activation-aware Weight Quantization)

Weight-only quantization method that protects the 1% of channels salient for output quality, enabling accurate 4-bit inference.

Back to Glossary Start Reading — Chapter 0

Quantization

Definition

Related

More Optimization terms

INT8

FP8

AWQ (Activation-aware Weight Quantization)