Optimization

Draft Model

Small, fast auxiliary model used in speculative decoding to propose candidate tokens for the larger target model to verify.

Definition

A draft model is a smaller and cheaper version of the target model used in speculative decoding. It generates K speculative tokens autoregressively, which are then verified in one batch by the large target model. The draft model is often a pruned or distilled version of the target, or a model in the same family (e.g., Llama 3 8B drafting for Llama 3 70B). For the technique to be efficient, the draft must be fast enough that K draft steps plus one target step is faster than K target steps alone.

Speculative Decoding EAGLE

More Optimization terms

Quantization

Reducing model weight (and optionally activation) precision from FP16/BF16 to INT8, FP8, or INT4 to cut VRAM and increase throughput.

INT8

8-bit integer quantization for model weights and/or activations, roughly halving memory vs. FP16 with small accuracy degradation.

FP8

8-bit floating-point format (E4M3 or E5M2) natively supported on H100/H200 GPUs, enabling faster matmuls with minimal accuracy loss vs. FP16.

Back to Glossary Start Reading — Chapter 0

Draft Model

Definition

Related

More Optimization terms

Quantization

INT8

FP8