Optimization

AWQ (Activation-aware Weight Quantization)

Weight-only quantization method that protects the 1% of channels salient for output quality, enabling accurate 4-bit inference.

Definition

AWQ (Lin et al., 2023) is a post-training, weight-only INT4 quantization method that observes which weight channels are most salient — as determined by activation magnitude — and scales those channels before quantizing, effectively protecting them from precision loss. Because only weights (not activations) are quantised, there is no per-token overhead, making AWQ fast to deploy. AWQ-quantised models run on consumer GPUs with llama.cpp or via the AutoAWQ library and are widely distributed on Hugging Face.

GPTQ Quantization Chapter 5: Techniques

More Optimization terms

Quantization

Reducing model weight (and optionally activation) precision from FP16/BF16 to INT8, FP8, or INT4 to cut VRAM and increase throughput.

INT8

8-bit integer quantization for model weights and/or activations, roughly halving memory vs. FP16 with small accuracy degradation.

FP8

8-bit floating-point format (E4M3 or E5M2) natively supported on H100/H200 GPUs, enabling faster matmuls with minimal accuracy loss vs. FP16.

Back to Glossary Start Reading — Chapter 0

AWQ (Activation-aware Weight Quantization)

Definition

Related

More Optimization terms

Quantization

INT8

FP8