Models: Transformer Architecture
4 sections · Quick reference card
Core Formulas
Scaled dot-product attention
Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) V
Model parameters (approx)
params ≈ 12 × n_layers × d_model² (dominant term)
VRAM for weights (FP16)
VRAM_weights = params × 2 bytes
KV cache per token
KV_bytes = 2 × n_layers × d_head × n_heads × bytes_per_element
Arithmetic intensity (decode)
AI = FLOPs / memory_bytes ≈ 1 op/byte for small batch
Prefill vs. Decode
| Property | Prefill | Decode |
|---|---|---|
| Parallelism | All tokens at once | One token at a time |
| Bottleneck | Compute (FLOP-bound) | Memory bandwidth (BW-bound) |
| Batch size effect | Linear throughput gain | Diminishing returns |
| KV cache | Written | Read + appended |
| Latency metric | TTFT | TPOT |
Transformer Block Components
- Multi-Head Attention (MHA)
- Full Q/K/V for all heads. Most memory-intensive. n_heads × d_head = d_model.
- Grouped Query Attention (GQA)
- Multiple query heads share fewer K/V heads. Reduces KV cache size. Used in Llama 3, Mistral.
- Multi-Query Attention (MQA)
- All query heads share a single K/V head. Maximum KV compression. Used in Falcon.
- Feed-Forward Network (FFN)
- Two linear layers with activation. Typically 4× hidden dim. SwiGLU uses 3 matrices.
- RMSNorm
- Root Mean Square normalization. Cheaper than LayerNorm. Used in most modern LLMs.
MoE vs Dense
- Dense model
- All parameters active per token. Compute cost scales with total params.
- Mixture of Experts (MoE)
- Router selects top-K experts per token. Active params << total params. Higher VRAM, lower FLOPs.
- Expert routing
- Token routed to K of N experts (typically K=2, N=8 or N=64). Load balancing is critical.