Disaggregated Inference
Architecture separating prefill and decode work onto different GPU pools, allowing each phase to be independently scaled and optimised.
Definition
Prefill is compute-bound (large matrix multiplies over the full prompt) while decode is memory-bandwidth-bound (streaming weights per generated token). Disaggregated inference runs these two phases on separate GPU pools — prefill on a small number of high-flop machines and decode on a larger pool of memory-bandwidth-rich machines — connected by a fast interconnect. KV cache blocks computed by the prefill nodes are transferred to decode nodes over NVLink or InfiniBand. This separation enables each phase to be right-sized, improving utilisation and cost efficiency at scale.
Related
More Architecture terms
KV Cache
GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.
Multi-Head Attention (MHA)
Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.
Grouped-Query Attention (GQA)
Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.