Optimization

Speculative Decoding

Latency technique where a small draft model proposes token sequences the target model verifies in parallel, typically cutting TTFT/latency by 2–3×.

Definition

Speculative decoding (Leviathan et al., 2022; Chen et al., 2023) uses a small, fast draft model to speculatively generate K candidate tokens, then runs the larger target model on all K+1 positions in a single parallel forward pass for verification. Accepted tokens are kept; the first rejected token and all after it are discarded, and a single corrected token is generated. Because accepted tokens require no extra compute beyond the verification pass, throughput and latency improve dramatically when the draft acceptance rate is high (typically 70–80% on aligned tasks).

Draft Model EAGLE Medusa Chapter 5: Techniques

More Optimization terms

Quantization

Reducing model weight (and optionally activation) precision from FP16/BF16 to INT8, FP8, or INT4 to cut VRAM and increase throughput.

INT8

8-bit integer quantization for model weights and/or activations, roughly halving memory vs. FP16 with small accuracy degradation.

FP8

8-bit floating-point format (E4M3 or E5M2) natively supported on H100/H200 GPUs, enabling faster matmuls with minimal accuracy loss vs. FP16.

Back to Glossary Start Reading — Chapter 0

Speculative Decoding

Definition

Related

More Optimization terms

Quantization

INT8

FP8