Architecture

Tensor Parallelism

Distributing individual weight matrices across multiple GPUs so each GPU computes a column/row shard, requiring all-reduce after each layer.

Definition

Tensor parallelism (TP) shards the large weight matrices of each Transformer layer across N GPUs. For a linear layer Y = XW, each GPU holds a column shard of W and computes a partial result; an all-reduce across GPUs then combines them into the final output. TP reduces per-GPU memory by a factor of N and increases the effective compute budget, but introduces N-1 all-reduce communications per layer. On NVLink-connected GPUs within a node, the all-reduce is fast enough that TP scales well to 8 GPUs; beyond 8, communication becomes a bottleneck.

Pipeline Parallelism NVLink Chapter 4: Software

More Architecture terms

KV Cache

GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.

Multi-Head Attention (MHA)

Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.

Grouped-Query Attention (GQA)

Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.

Back to Glossary Start Reading — Chapter 0