Skip to content
Optimization

Medusa

Speculative decoding approach adding multiple prediction heads to the base model, each independently predicting tokens K steps ahead.

Definition

Medusa attaches several lightweight feed-forward heads directly to the target model's top layer, each trained to predict the token at a different future position (head i predicts token t+i). At inference time, all heads run in parallel to propose a tree of candidate continuations, and a tree-attention verification pass accepts or rejects branches. Because no separate draft model is needed, Medusa is simpler to deploy than traditional speculative decoding, though it requires a brief fine-tuning step to train the extra heads.

More Optimization terms