Skip to content
Architecture

Disaggregated Inference

Architecture separating prefill and decode work onto different GPU pools, allowing each phase to be independently scaled and optimised.

Definition

Prefill is compute-bound (large matrix multiplies over the full prompt) while decode is memory-bandwidth-bound (streaming weights per generated token). Disaggregated inference runs these two phases on separate GPU pools — prefill on a small number of high-flop machines and decode on a larger pool of memory-bandwidth-rich machines — connected by a fast interconnect. KV cache blocks computed by the prefill nodes are transferred to decode nodes over NVLink or InfiniBand. This separation enables each phase to be right-sized, improving utilisation and cost efficiency at scale.

More Architecture terms