Skip to content
Architecture

Mixture of Experts (MoE)

Architecture where each token is routed to a sparse subset of specialist feed-forward layers, enabling large parameter counts at low active-parameter cost.

Definition

In a Mixture-of-Experts (MoE) model, each Transformer layer replaces the single dense feed-forward network with E expert FFN modules and a routing network. For each token, the router selects the top-K experts (typically 2) to activate, so only a fraction of parameters are used per forward pass. This allows models like Mixtral and DeepSeek to have billions of total parameters while using compute comparable to a much smaller dense model. The inference challenge is that routing can cause load imbalance across experts, and expert parallelism across GPUs requires additional all-to-all communication.

More Architecture terms