Skip to content
Architecture

Expert Parallelism

Parallelism strategy for MoE models that places different expert FFN modules on different GPUs, routing tokens via all-to-all communication.

Definition

Expert parallelism (EP) is a distribution strategy tailored to Mixture-of-Experts models. Each GPU (or group of GPUs) hosts a disjoint subset of the expert FFN modules. The router determines which expert each token should go to; tokens are then sent to the correct GPU via an all-to-all operation. Expert parallelism requires fast interconnects because every token may be sent to any expert GPU. EP is typically combined with tensor parallelism for the attention layers, and load balancing across experts is a critical implementation concern.

More Architecture terms