AWQ (Activation-aware Weight Quantization)
Weight-only quantization method that protects the 1% of channels salient for output quality, enabling accurate 4-bit inference.
Definition
AWQ (Lin et al., 2023) is a post-training, weight-only INT4 quantization method that observes which weight channels are most salient — as determined by activation magnitude — and scales those channels before quantizing, effectively protecting them from precision loss. Because only weights (not activations) are quantised, there is no per-token overhead, making AWQ fast to deploy. AWQ-quantised models run on consumer GPUs with llama.cpp or via the AutoAWQ library and are widely distributed on Hugging Face.
Related
More Optimization terms
Quantization
Reducing model weight (and optionally activation) precision from FP16/BF16 to INT8, FP8, or INT4 to cut VRAM and increase throughput.
INT8
8-bit integer quantization for model weights and/or activations, roughly halving memory vs. FP16 with small accuracy degradation.
FP8
8-bit floating-point format (E4M3 or E5M2) natively supported on H100/H200 GPUs, enabling faster matmuls with minimal accuracy loss vs. FP16.