Techniques: Optimization Deep Dives
4 sections · Quick reference card
Quantization Formats
| Format | Bits | VRAM vs FP16 | Typical Quality Loss | Notes |
|---|---|---|---|---|
| FP32 | 32 | 2× | None (baseline) | Training only |
| FP16 / BF16 | 16 | 1× | None | Standard inference |
| FP8 | 8 | 0.5× | < 1% | H100/H200 native, NVIDIA Hopper+ |
| INT8 (W8A8) | 8 | 0.5× | < 1% | SmoothQuant, LLM.int8() |
| INT4 (W4A16) | 4 | 0.25× | 1–3% | GPTQ, AWQ, common for LLMs |
| INT4 (W4A8) | 4/8 | 0.25× | 1–3% | Higher throughput variant |
| 2-bit | 2 | 0.125× | 5–10%+ | QuIP#, AQLM — aggressive |
KV Cache Formulas
KV cache per token (bytes)
2 × n_layers × (n_kv_heads × d_head) × dtype_bytes
Total KV cache (bytes)
kv_per_token × max_seq_len × batch_size
KV cache for Llama-3 70B (FP16)
2 × 80 × (8 × 128) × 2 = 327,680 bytes/token ≈ 320 KB/token
KV cache compression (GQA ratio)
kv_size_reduction = n_kv_heads / n_query_heads
Parallelism Types
| Type | What's Split | When to Use | Comm Overhead |
|---|---|---|---|
| Tensor (TP) | Weight matrices across GPUs | Model too large for 1 GPU | High (all-reduce each layer) |
| Pipeline (PP) | Layers across GPUs | Very deep models, large batch | Medium (bubble overhead) |
| Data (DP) | Batch across GPU replicas | High throughput, model fits 1 GPU | Low (gradient sync only) |
| Sequence (SP) | Sequence length across GPUs | Very long context | Medium (ring attention) |
| Expert (EP) | MoE experts across GPUs | Large MoE models | Medium (all-to-all routing) |
Speculative Decoding
- Draft model
- Small fast model generates K candidate tokens speculatively (e.g., 4–8 tokens).
- Target model
- Large model verifies all K drafts in a single parallel forward pass.
- Acceptance rate (α)
- Fraction of draft tokens accepted. Higher α = better speedup. Depends on draft/target similarity.
- Speedup
- Expected tokens per step = (1 - α^(K+1)) / (1 - α). Up to 2–3× at high acceptance rates.