Skip to content
Architecture

Pipeline Parallelism

Multi-node strategy that assigns consecutive Transformer layer groups to different GPU nodes, passing activations between stages over the network.

Definition

Pipeline parallelism (PP) divides the Transformer's layers into contiguous stage groups, each assigned to a different GPU or node. During inference, activations flow sequentially from stage 0 through stage N-1, with inter-stage communication over NVLink (within node) or InfiniBand (cross-node). PP adds at most one all-reduce per layer stage, which is less communication than tensor parallelism per device, but introduces pipeline bubbles (idle time while waiting for upstream stages). PP is typically combined with TP for models spanning many nodes.

More Architecture terms