Architecture

Disaggregated Inference

Architecture separating prefill and decode work onto different GPU pools, allowing each phase to be independently scaled and optimised.

Definition

Prefill is compute-bound (large matrix multiplies over the full prompt) while decode is memory-bandwidth-bound (streaming weights per generated token). Disaggregated inference runs these two phases on separate GPU pools — prefill on a small number of high-flop machines and decode on a larger pool of memory-bandwidth-rich machines — connected by a fast interconnect. KV cache blocks computed by the prefill nodes are transferred to decode nodes over NVLink or InfiniBand. This separation enables each phase to be right-sized, improving utilisation and cost efficiency at scale.

Prefill Phase Decode Phase Continuous Batching

More Architecture terms

KV Cache

GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.

Multi-Head Attention (MHA)

Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.

Grouped-Query Attention (GQA)

Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.

Back to Glossary Start Reading — Chapter 0