An inference engineer's goal is to create a balanced set of optimizations that delivers more than the sum of its parts. Sometimes techniques are symbiotic, sometimes incompatible — quantizing the KV cache helps disaggregation, but increasing batch size reduces compute available for speculation.
Quantization#
Quantization improves latency (both TTFT and TPS), increases throughput, and opens headroom for other optimizations. But when it goes wrong, it can materially reduce output quality.
Post-training quantization changes model weights from their native format (usually BF16 or FP16) to a lower-precision format:
- Prefill: Compute-bound prefill now runs on lower-precision Tensor Cores with 2x FLOPS
- Decode: Memory-bound decode now loads half as much data, effectively doubling bandwidth
In practice, quantization down a single level of precision gives 30-50% better performance for LLMs.
Number Formats
| Name | Bits | First Architecture | Use |
|---|---|---|---|
| FP16 | 16 | Pascal (2016) | Default inference |
| BF16 | 16 | Ampere (2020) | Training & inference |
| FP8 | 8 | Hopper (2022) | Sweet spot for quality/speed |
| MXFP8 | 8 | Blackwell (2024) | Microscaling for better accuracy |
| FP4 | 4 | Blackwell (2024) | Aggressive quantization |
| NVFP4 | 4 | Blackwell (2024) | Best FP4 accuracy (block size 16) |
Floating-point formats have sign, exponent, and mantissa bits — the exponent gives higher dynamic range than integers, better representing outliers after quantization.
Granularity matters: tensor-level (one scale factor), channel-level, or block-level quantization. More granular = lower risk of quality loss but more overhead.
Quantization Approaches
Components vary in sensitivity — from least to most:
- Weights (linear layers) — least sensitive
- Activations — somewhat sensitive
- KV cache — moderately sensitive (errors compound token-to-token)
- Attention (especially softmax) — highly sensitive
Key Takeaway
FP8/MXFP8 is the sweet spot for production inference. Quantize weights and activations with FP8, carefully quantize KV cache, and leave attention in original precision. If you can't risk quality, every other technique in this chapter is lossless.
Try it: Quantization Quality Estimator →
Explore precision tradeoffs: memory savings, speedup, and quality risk.
Measuring Quality Impact
Three methods: perplexity (simplest), intelligence benchmarks (MMLU, SWE-bench), and custom evals (best). In every case, you're looking for differences indistinguishable from noise.
Speculative Decoding#
Speculative decoding exploits spare compute during memory-bound decode to generate multiple tokens per forward pass. Only improves TPS/ITL, not TTFT.
The mechanism:
- A speculator generates draft tokens
- The target model validates drafts in a single forward pass
- Accepted tokens + one new token = N+1 tokens per pass
Performance depends on: draft token cost, draft sequence length, and token acceptance rate.
Try it: Speculative Decoding Simulator →
Adjust draft length, acceptance rate, and overhead to see effective TPS.
Draft-Target Speculation
Uses a small draft model (10x smaller than target, usually same family). Quick to set up but highest overhead — needs to store draft model weights, activations, and KV cache.
Medusa
Fine-tunes the target model to add 2-4 extra decoder heads that generate draft tokens. No separate model needed, but limited acceptance rate.
EAGLE
Purpose-built draft model trained to consume hidden states from the target model and generate up to 8 draft tokens with high acceptance rate. The go-to algorithm for general use.
N-gram Speculation
Constructs an n-gram dictionary during prefill. Sequences can exceed 10 tokens. Mostly used for code completion where syntax is predictable and output closely matches input.
Caching#
Try it: KV Cache Sizing Calculator →
Calculate KV cache memory for different models, contexts, and precisions.
Prefix Caching and KV Cache Re-Use
Many requests share common prefixes — system prompts, few-shot examples, previous conversation turns. Prefix caching stores and reuses KV cache for these shared prefixes, skipping redundant prefill computation.
Prefix caching is especially effective for:
- Multi-turn chat: Reuse KV from prior turns
- Code completion: Reuse KV from file context
- Agents: Reuse KV from system prompt and tools
Where to Store the KV Cache
By default, KV cache lives on GPU VRAM. Options for overflow:
- CPU memory (via Grace NVLink C2C for fast access)
- Distributed KV stores (across replicas)
- Disk (last resort)
Cache-Aware Routing
Route requests to replicas that already hold matching prefixes. Higher cache hit rates yield lower TTFT.
Model Parallelism#
When a model is too large for one GPU, split it across multiple GPUs:
| Strategy | Mechanism | Best For |
|---|---|---|
| Tensor Parallelism (TP) | Split individual layers across GPUs | Lower latency (requires NVLink) |
| Expert Parallelism (EP) | Shard MoE experts across GPUs | Higher throughput for MoE models |
| Pipeline Parallelism (PP) | Split layers into stages | Multi-node (but introduces bubbles) |
Key Takeaway
Use Tensor Parallelism within a node (where NVLink provides high bandwidth). For MoE models, Expert Parallelism enables each GPU to hold multiple full experts. Pipeline Parallelism is a last resort for multi-node setups.
Disaggregation#
Disaggregation separates prefill and decode onto independently scaling hardware:
- Prefill workers: Optimized for compute (high FLOPS)
- Decode workers: Optimized for memory bandwidth
The KV cache computed during prefill must be transferred to decode workers. KV cache quantization alleviates this transfer bottleneck.
Use disaggregation when:
- Prefill and decode have very different resource needs
- Traffic patterns vary (some requests are prefill-heavy, others decode-heavy)
- You need independent scaling for cost efficiency
NVIDIA Dynamo provides dynamic disaggregation, automatically routing between prefill and decode workers based on real-time load.
Check Your Understanding
1 / 13What performance improvement does FP16→FP8 quantization typically provide?