Optimization

FP8

8-bit floating-point format (E4M3 or E5M2) natively supported on H100/H200 GPUs, enabling faster matmuls with minimal accuracy loss vs. FP16.

Definition

FP8 is a family of 8-bit floating-point formats (E4M3 and E5M2, defined in the OCP MX specification) that NVIDIA H100 and H200 GPUs execute natively in Tensor Core operations. Unlike INT8, FP8 maintains a floating-point dynamic range, making it more robust to activation outliers. H100 FP8 Tensor Cores deliver roughly 2× the TFLOPS of BF16, making FP8 a compelling option for training and serving. Frameworks such as TensorRT-LLM, vLLM, and SGLang all support FP8 quantization for supported models.

INT8 Tensor Core Chapter 3: Hardware

More Optimization terms

Quantization

Reducing model weight (and optionally activation) precision from FP16/BF16 to INT8, FP8, or INT4 to cut VRAM and increase throughput.

INT8

8-bit integer quantization for model weights and/or activations, roughly halving memory vs. FP16 with small accuracy degradation.

AWQ (Activation-aware Weight Quantization)

Weight-only quantization method that protects the 1% of channels salient for output quality, enabling accurate 4-bit inference.

Back to Glossary Start Reading — Chapter 0

FP8

Definition

Related

More Optimization terms

Quantization

INT8

AWQ (Activation-aware Weight Quantization)