Architecture

Mixture of Experts (MoE)

Architecture where each token is routed to a sparse subset of specialist feed-forward layers, enabling large parameter counts at low active-parameter cost.

Definition

In a Mixture-of-Experts (MoE) model, each Transformer layer replaces the single dense feed-forward network with E expert FFN modules and a routing network. For each token, the router selects the top-K experts (typically 2) to activate, so only a fraction of parameters are used per forward pass. This allows models like Mixtral and DeepSeek to have billions of total parameters while using compute comparable to a much smaller dense model. The inference challenge is that routing can cause load imbalance across experts, and expert parallelism across GPUs requires additional all-to-all communication.

Expert Parallelism Tensor Parallelism Chapter 2: Architecture

More Architecture terms

KV Cache

GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.

Multi-Head Attention (MHA)

Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.

Grouped-Query Attention (GQA)

Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.

Back to Glossary Start Reading — Chapter 0