Architecture

Pipeline Parallelism

Multi-node strategy that assigns consecutive Transformer layer groups to different GPU nodes, passing activations between stages over the network.

Definition

Pipeline parallelism (PP) divides the Transformer's layers into contiguous stage groups, each assigned to a different GPU or node. During inference, activations flow sequentially from stage 0 through stage N-1, with inter-stage communication over NVLink (within node) or InfiniBand (cross-node). PP adds at most one all-reduce per layer stage, which is less communication than tensor parallelism per device, but introduces pipeline bubbles (idle time while waiting for upstream stages). PP is typically combined with TP for models spanning many nodes.

Tensor Parallelism Expert Parallelism

More Architecture terms

KV Cache

GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.

Multi-Head Attention (MHA)

Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.

Grouped-Query Attention (GQA)

Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.

Back to Glossary Start Reading — Chapter 0