Skip to content
Architecture

Tensor Parallelism

Distributing individual weight matrices across multiple GPUs so each GPU computes a column/row shard, requiring all-reduce after each layer.

Definition

Tensor parallelism (TP) shards the large weight matrices of each Transformer layer across N GPUs. For a linear layer Y = XW, each GPU holds a column shard of W and computes a partial result; an all-reduce across GPUs then combines them into the final output. TP reduces per-GPU memory by a factor of N and increases the effective compute budget, but introduces N-1 all-reduce communications per layer. On NVLink-connected GPUs within a node, the all-reduce is fast enough that TP scales well to 8 GPUs; beyond 8, communication becomes a bottleneck.

More Architecture terms