Skip to content
Cheat Sheets/Techniques: Optimization Deep Dives

Techniques: Optimization Deep Dives

4 sections · Quick reference card

Quantization Formats

FormatBitsVRAM vs FP16Typical Quality LossNotes
FP3232None (baseline)Training only
FP16 / BF1616NoneStandard inference
FP880.5×< 1%H100/H200 native, NVIDIA Hopper+
INT8 (W8A8)80.5×< 1%SmoothQuant, LLM.int8()
INT4 (W4A16)40.25×1–3%GPTQ, AWQ, common for LLMs
INT4 (W4A8)4/80.25×1–3%Higher throughput variant
2-bit20.125×5–10%+QuIP#, AQLM — aggressive

KV Cache Formulas

KV cache per token (bytes)

2 × n_layers × (n_kv_heads × d_head) × dtype_bytes

Total KV cache (bytes)

kv_per_token × max_seq_len × batch_size

KV cache for Llama-3 70B (FP16)

2 × 80 × (8 × 128) × 2 = 327,680 bytes/token ≈ 320 KB/token

KV cache compression (GQA ratio)

kv_size_reduction = n_kv_heads / n_query_heads

Parallelism Types

TypeWhat's SplitWhen to UseComm Overhead
Tensor (TP)Weight matrices across GPUsModel too large for 1 GPUHigh (all-reduce each layer)
Pipeline (PP)Layers across GPUsVery deep models, large batchMedium (bubble overhead)
Data (DP)Batch across GPU replicasHigh throughput, model fits 1 GPULow (gradient sync only)
Sequence (SP)Sequence length across GPUsVery long contextMedium (ring attention)
Expert (EP)MoE experts across GPUsLarge MoE modelsMedium (all-to-all routing)

Speculative Decoding

Draft model
Small fast model generates K candidate tokens speculatively (e.g., 4–8 tokens).
Target model
Large model verifies all K drafts in a single parallel forward pass.
Acceptance rate (α)
Fraction of draft tokens accepted. Higher α = better speedup. Depends on draft/target similarity.
Speedup
Expected tokens per step = (1 - α^(K+1)) / (1 - α). Up to 2–3× at high acceptance rates.