Skip to content
Chapter 5

Techniques

Optimization Deep Dives

40 min read5 sections

An inference engineer's goal is to create a balanced set of optimizations that delivers more than the sum of its parts. Sometimes techniques are symbiotic, sometimes incompatible — quantizing the KV cache helps disaggregation, but increasing batch size reduces compute available for speculation.

Quantization#

Quantization improves latency (both TTFT and TPS), increases throughput, and opens headroom for other optimizations. But when it goes wrong, it can materially reduce output quality.

Post-training quantization changes model weights from their native format (usually BF16 or FP16) to a lower-precision format:

  • Prefill: Compute-bound prefill now runs on lower-precision Tensor Cores with 2x FLOPS
  • Decode: Memory-bound decode now loads half as much data, effectively doubling bandwidth

In practice, quantization down a single level of precision gives 30-50% better performance for LLMs.

Number Formats

NameBitsFirst ArchitectureUse
FP1616Pascal (2016)Default inference
BF1616Ampere (2020)Training & inference
FP88Hopper (2022)Sweet spot for quality/speed
MXFP88Blackwell (2024)Microscaling for better accuracy
FP44Blackwell (2024)Aggressive quantization
NVFP44Blackwell (2024)Best FP4 accuracy (block size 16)

Floating-point formats have sign, exponent, and mantissa bits — the exponent gives higher dynamic range than integers, better representing outliers after quantization.

Granularity matters: tensor-level (one scale factor), channel-level, or block-level quantization. More granular = lower risk of quality loss but more overhead.

Quantization Approaches

Components vary in sensitivity — from least to most:

  1. Weights (linear layers) — least sensitive
  2. Activations — somewhat sensitive
  3. KV cache — moderately sensitive (errors compound token-to-token)
  4. Attention (especially softmax) — highly sensitive

Key Takeaway

FP8/MXFP8 is the sweet spot for production inference. Quantize weights and activations with FP8, carefully quantize KV cache, and leave attention in original precision. If you can't risk quality, every other technique in this chapter is lossless.

Try it: Quantization Quality Estimator

Explore precision tradeoffs: memory savings, speedup, and quality risk.

Measuring Quality Impact

Three methods: perplexity (simplest), intelligence benchmarks (MMLU, SWE-bench), and custom evals (best). In every case, you're looking for differences indistinguishable from noise.

Speculative Decoding#

Speculative decoding exploits spare compute during memory-bound decode to generate multiple tokens per forward pass. Only improves TPS/ITL, not TTFT.

The mechanism:

  1. A speculator generates draft tokens
  2. The target model validates drafts in a single forward pass
  3. Accepted tokens + one new token = N+1 tokens per pass

Performance depends on: draft token cost, draft sequence length, and token acceptance rate.

Try it: Speculative Decoding Simulator

Adjust draft length, acceptance rate, and overhead to see effective TPS.

Draft-Target Speculation

Uses a small draft model (10x smaller than target, usually same family). Quick to set up but highest overhead — needs to store draft model weights, activations, and KV cache.

Medusa

Fine-tunes the target model to add 2-4 extra decoder heads that generate draft tokens. No separate model needed, but limited acceptance rate.

EAGLE

Purpose-built draft model trained to consume hidden states from the target model and generate up to 8 draft tokens with high acceptance rate. The go-to algorithm for general use.

N-gram Speculation

Constructs an n-gram dictionary during prefill. Sequences can exceed 10 tokens. Mostly used for code completion where syntax is predictable and output closely matches input.

Caching#

Try it: KV Cache Sizing Calculator

Calculate KV cache memory for different models, contexts, and precisions.

Prefix Caching and KV Cache Re-Use

Many requests share common prefixes — system prompts, few-shot examples, previous conversation turns. Prefix caching stores and reuses KV cache for these shared prefixes, skipping redundant prefill computation.

Prefix caching is especially effective for:

  • Multi-turn chat: Reuse KV from prior turns
  • Code completion: Reuse KV from file context
  • Agents: Reuse KV from system prompt and tools

Where to Store the KV Cache

By default, KV cache lives on GPU VRAM. Options for overflow:

  • CPU memory (via Grace NVLink C2C for fast access)
  • Distributed KV stores (across replicas)
  • Disk (last resort)

Cache-Aware Routing

Route requests to replicas that already hold matching prefixes. Higher cache hit rates yield lower TTFT.

Model Parallelism#

When a model is too large for one GPU, split it across multiple GPUs:

StrategyMechanismBest For
Tensor Parallelism (TP)Split individual layers across GPUsLower latency (requires NVLink)
Expert Parallelism (EP)Shard MoE experts across GPUsHigher throughput for MoE models
Pipeline Parallelism (PP)Split layers into stagesMulti-node (but introduces bubbles)

Key Takeaway

Use Tensor Parallelism within a node (where NVLink provides high bandwidth). For MoE models, Expert Parallelism enables each GPU to hold multiple full experts. Pipeline Parallelism is a last resort for multi-node setups.

Disaggregation#

Disaggregation separates prefill and decode onto independently scaling hardware:

  • Prefill workers: Optimized for compute (high FLOPS)
  • Decode workers: Optimized for memory bandwidth

The KV cache computed during prefill must be transferred to decode workers. KV cache quantization alleviates this transfer bottleneck.

Use disaggregation when:

  • Prefill and decode have very different resource needs
  • Traffic patterns vary (some requests are prefill-heavy, others decode-heavy)
  • You need independent scaling for cost efficiency

NVIDIA Dynamo provides dynamic disaggregation, automatically routing between prefill and decode workers based on real-time load.

Check Your Understanding

1 / 13

What performance improvement does FP16→FP8 quantization typically provide?