Tensor Parallelism
Distributing individual weight matrices across multiple GPUs so each GPU computes a column/row shard, requiring all-reduce after each layer.
Definition
Tensor parallelism (TP) shards the large weight matrices of each Transformer layer across N GPUs. For a linear layer Y = XW, each GPU holds a column shard of W and computes a partial result; an all-reduce across GPUs then combines them into the final output. TP reduces per-GPU memory by a factor of N and increases the effective compute budget, but introduces N-1 all-reduce communications per layer. On NVLink-connected GPUs within a node, the all-reduce is fast enough that TP scales well to 8 GPUs; beyond 8, communication becomes a bottleneck.
Related
More Architecture terms
KV Cache
GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.
Multi-Head Attention (MHA)
Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.
Grouped-Query Attention (GQA)
Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.