Medusa
Speculative decoding approach adding multiple prediction heads to the base model, each independently predicting tokens K steps ahead.
Definition
Medusa attaches several lightweight feed-forward heads directly to the target model's top layer, each trained to predict the token at a different future position (head i predicts token t+i). At inference time, all heads run in parallel to propose a tree of candidate continuations, and a tree-attention verification pass accepts or rejects branches. Because no separate draft model is needed, Medusa is simpler to deploy than traditional speculative decoding, though it requires a brief fine-tuning step to train the extra heads.
Related
More Optimization terms
Quantization
Reducing model weight (and optionally activation) precision from FP16/BF16 to INT8, FP8, or INT4 to cut VRAM and increase throughput.
INT8
8-bit integer quantization for model weights and/or activations, roughly halving memory vs. FP16 with small accuracy degradation.
FP8
8-bit floating-point format (E4M3 or E5M2) natively supported on H100/H200 GPUs, enabling faster matmuls with minimal accuracy loss vs. FP16.