Pipeline Parallelism
Multi-node strategy that assigns consecutive Transformer layer groups to different GPU nodes, passing activations between stages over the network.
Definition
Pipeline parallelism (PP) divides the Transformer's layers into contiguous stage groups, each assigned to a different GPU or node. During inference, activations flow sequentially from stage 0 through stage N-1, with inter-stage communication over NVLink (within node) or InfiniBand (cross-node). PP adds at most one all-reduce per layer stage, which is less communication than tensor parallelism per device, but introduces pipeline bubbles (idle time while waiting for upstream stages). PP is typically combined with TP for models spanning many nodes.
Related
More Architecture terms
KV Cache
GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.
Multi-Head Attention (MHA)
Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.
Grouped-Query Attention (GQA)
Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.