Architecture

Expert Parallelism

Parallelism strategy for MoE models that places different expert FFN modules on different GPUs, routing tokens via all-to-all communication.

Definition

Expert parallelism (EP) is a distribution strategy tailored to Mixture-of-Experts models. Each GPU (or group of GPUs) hosts a disjoint subset of the expert FFN modules. The router determines which expert each token should go to; tokens are then sent to the correct GPU via an all-to-all operation. Expert parallelism requires fast interconnects because every token may be sent to any expert GPU. EP is typically combined with tensor parallelism for the attention layers, and load balancing across experts is a critical implementation concern.

Mixture of Experts (MoE)Tensor Parallelism

More Architecture terms

KV Cache

GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.

Multi-Head Attention (MHA)

Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.

Grouped-Query Attention (GQA)

Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.

Back to Glossary Start Reading — Chapter 0