GPTQ
Layer-wise weight quantization using second-order Hessian information to minimise quantization error, supporting 4-bit and 8-bit precisions.
Definition
GPTQ (Frantar et al., 2022) is a one-shot post-training quantization algorithm that applies a closed-form update to unquantized weights to compensate for the rounding errors introduced as each weight is quantised. It uses the inverse Hessian of the layer's squared error, estimated from a small calibration dataset, to determine the optimal correction. GPTQ achieves near-FP16 quality at INT4 for large models and is the basis for several popular quantized model families available on Hugging Face.
Related
More Optimization terms
Quantization
Reducing model weight (and optionally activation) precision from FP16/BF16 to INT8, FP8, or INT4 to cut VRAM and increase throughput.
INT8
8-bit integer quantization for model weights and/or activations, roughly halving memory vs. FP16 with small accuracy degradation.
FP8
8-bit floating-point format (E4M3 or E5M2) natively supported on H100/H200 GPUs, enabling faster matmuls with minimal accuracy loss vs. FP16.