Optimization

Medusa

Speculative decoding approach adding multiple prediction heads to the base model, each independently predicting tokens K steps ahead.

Definition

Medusa attaches several lightweight feed-forward heads directly to the target model's top layer, each trained to predict the token at a different future position (head i predicts token t+i). At inference time, all heads run in parallel to propose a tree of candidate continuations, and a tree-attention verification pass accepts or rejects branches. Because no separate draft model is needed, Medusa is simpler to deploy than traditional speculative decoding, though it requires a brief fine-tuning step to train the extra heads.

Speculative Decoding EAGLE

More Optimization terms

Quantization

Reducing model weight (and optionally activation) precision from FP16/BF16 to INT8, FP8, or INT4 to cut VRAM and increase throughput.

INT8

8-bit integer quantization for model weights and/or activations, roughly halving memory vs. FP16 with small accuracy degradation.

FP8

8-bit floating-point format (E4M3 or E5M2) natively supported on H100/H200 GPUs, enabling faster matmuls with minimal accuracy loss vs. FP16.

Back to Glossary Start Reading — Chapter 0

Medusa

Definition

Related

More Optimization terms

Quantization

INT8

FP8