Skip to content
Cheat Sheets/Models: Transformer Architecture

Models: Transformer Architecture

4 sections · Quick reference card

Core Formulas

Scaled dot-product attention

Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) V

Model parameters (approx)

params ≈ 12 × n_layers × d_model² (dominant term)

VRAM for weights (FP16)

VRAM_weights = params × 2 bytes

KV cache per token

KV_bytes = 2 × n_layers × d_head × n_heads × bytes_per_element

Arithmetic intensity (decode)

AI = FLOPs / memory_bytes ≈ 1 op/byte for small batch

Prefill vs. Decode

PropertyPrefillDecode
ParallelismAll tokens at onceOne token at a time
BottleneckCompute (FLOP-bound)Memory bandwidth (BW-bound)
Batch size effectLinear throughput gainDiminishing returns
KV cacheWrittenRead + appended
Latency metricTTFTTPOT

Transformer Block Components

Multi-Head Attention (MHA)
Full Q/K/V for all heads. Most memory-intensive. n_heads × d_head = d_model.
Grouped Query Attention (GQA)
Multiple query heads share fewer K/V heads. Reduces KV cache size. Used in Llama 3, Mistral.
Multi-Query Attention (MQA)
All query heads share a single K/V head. Maximum KV compression. Used in Falcon.
Feed-Forward Network (FFN)
Two linear layers with activation. Typically 4× hidden dim. SwiGLU uses 3 matrices.
RMSNorm
Root Mean Square normalization. Cheaper than LayerNorm. Used in most modern LLMs.

MoE vs Dense

Dense model
All parameters active per token. Compute cost scales with total params.
Mixture of Experts (MoE)
Router selects top-K experts per token. Active params << total params. Higher VRAM, lower FLOPs.
Expert routing
Token routed to K of N experts (typically K=2, N=8 or N=64). Load balancing is critical.