Mixture of Experts (MoE)
Architecture where each token is routed to a sparse subset of specialist feed-forward layers, enabling large parameter counts at low active-parameter cost.
Definition
In a Mixture-of-Experts (MoE) model, each Transformer layer replaces the single dense feed-forward network with E expert FFN modules and a routing network. For each token, the router selects the top-K experts (typically 2) to activate, so only a fraction of parameters are used per forward pass. This allows models like Mixtral and DeepSeek to have billions of total parameters while using compute comparable to a much smaller dense model. The inference challenge is that routing can cause load imbalance across experts, and expert parallelism across GPUs requires additional all-to-all communication.
Related
More Architecture terms
KV Cache
GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.
Multi-Head Attention (MHA)
Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.
Grouped-Query Attention (GQA)
Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.