Optimization

EAGLE

Speculative decoding variant with a lightweight head on target-model hidden states, achieving higher draft acceptance rates than token-level draft models.

Definition

EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) replaces the token-level draft model with a single transformer decoder layer that operates on the target model's hidden-state features rather than token embeddings. Because it works in the continuous feature space it achieves higher draft acceptance rates than token-level methods, typically translating to 3–4× speedup on generation benchmarks. EAGLE v2 further improves acceptance with a dynamic speculative tree.

Speculative Decoding Medusa

More Optimization terms

Quantization

Reducing model weight (and optionally activation) precision from FP16/BF16 to INT8, FP8, or INT4 to cut VRAM and increase throughput.

INT8

8-bit integer quantization for model weights and/or activations, roughly halving memory vs. FP16 with small accuracy degradation.

FP8

8-bit floating-point format (E4M3 or E5M2) natively supported on H100/H200 GPUs, enabling faster matmuls with minimal accuracy loss vs. FP16.

Back to Glossary Start Reading — Chapter 0

EAGLE

Definition

Related

More Optimization terms

Quantization

INT8

FP8