What is inference engineering?

Inference engineering is the discipline of optimizing and deploying AI models for production use. It covers three layers: runtime (GPU hardware, CUDA, model optimization), infrastructure (serving frameworks, autoscaling, multi-GPU), and tooling (developer experience, benchmarking, observability).

How do I calculate VRAM requirements for LLM inference?

VRAM = model weights (params × bytes per param) + KV cache (2 × layers × kv_heads × head_dim × seq_len × batch × bytes) + activations (~5% of weights) + overhead (~1.5 GB). Use our interactive VRAM Calculator at inferenceengineering.tech/exercises/vram-calculator.

What are the best inference serving frameworks?

The three leading inference engines are vLLM (best ease of use, GPU/TPU support), SGLang (excellent performance, NVIDIA/AMD), and TensorRT-LLM (raw NVIDIA performance). The best choice depends on your model, scale, and hardware.

What GPU should I use for LLM inference?

For production LLM inference: NVIDIA H100/H200 for large models (70B+), A100 for medium models (7-70B), L4 for small models or budget deployments, and B200/B300 (Blackwell) for next-gen workloads. Use our GPU Selection Advisor at inferenceengineering.tech/exercises/gpu-advisor.

Chapter 5: Techniques | Inference Engineering

An inference engineer's goal is to create a balanced set of optimizations that delivers more than the sum of its parts. Sometimes techniques are symbiotic, sometimes incompatible — quantizing the KV cache helps disaggregation, but increasing batch size reduces compute available for speculation.

Quantization#

Quantization improves latency (both TTFT and TPS), increases throughput, and opens headroom for other optimizations. But when it goes wrong, it can materially reduce output quality.

Post-training quantization changes model weights from their native format (usually BF16 or FP16) to a lower-precision format:

Prefill: Compute-bound prefill now runs on lower-precision Tensor Cores with 2x FLOPS
Decode: Memory-bound decode now loads half as much data, effectively doubling bandwidth

In practice, quantization down a single level of precision gives 30-50% better performance for LLMs.

Quantization: Number Formats

Compare floating-point formats used in inference. Fewer bits = faster but riskier.

FP8 (E4M3) Bit Layout

Sign (1) Exponent (4) Mantissa (3)

Precision

8-bit

Dynamic Range

Medium

Speedup

~1.5x

Quality Risk

Low

Distinct Values Representable

FP16

65,536

BF16

65,536

FP8 (E4M3)

256

FP8 (E5M2)

256

FP4

NVFP4

Number Formats

Name	Bits	First Architecture	Use
FP16	16	Pascal (2016)	Default inference
BF16	16	Ampere (2020)	Training & inference
FP8	8	Hopper (2022)	Sweet spot for quality/speed
MXFP8	8	Blackwell (2024)	Microscaling for better accuracy
FP4	4	Blackwell (2024)	Aggressive quantization
NVFP4	4	Blackwell (2024)	Best FP4 accuracy (block size 16)

Floating-point formats have sign, exponent, and mantissa bits — the exponent gives higher dynamic range than integers, better representing outliers after quantization.

Granularity matters: tensor-level (one scale factor), channel-level, or block-level quantization. More granular = lower risk of quality loss but more overhead.

Quantization Approaches

Components vary in sensitivity — from least to most:

Weights (linear layers) — least sensitive
Activations — somewhat sensitive
KV cache — moderately sensitive (errors compound token-to-token)
Attention (especially softmax) — highly sensitive

Key Takeaway

FP8/MXFP8 is the sweet spot for production inference. Quantize weights and activations with FP8, carefully quantize KV cache, and leave attention in original precision. If you can't risk quality, every other technique in this chapter is lossless.

Try it: Quantization Quality Estimator →

Explore precision tradeoffs: memory savings, speedup, and quality risk.

Measuring Quality Impact

Three methods: perplexity (simplest), intelligence benchmarks (MMLU, SWE-bench), and custom evals (best). In every case, you're looking for differences indistinguishable from noise.

Speculative Decoding#

Speculative decoding exploits spare compute during memory-bound decode to generate multiple tokens per forward pass. Only improves TPS/ITL, not TTFT.

The mechanism:

A speculator generates draft tokens
The target model validates drafts in a single forward pass
Accepted tokens + one new token = N+1 tokens per pass

Performance depends on: draft token cost, draft sequence length, and token acceptance rate.

Speculative Decoding

Draft tokens are generated cheaply, then verified by the target model in a single pass.

1. Draft

2. Verify

3. Accept + Gen

Draft:ThecapitalofFrance

Draft Generation

The speculator (EAGLE/draft model) generates 4 draft tokens quickly using spare compute.

Try it: Speculative Decoding Simulator →

Adjust draft length, acceptance rate, and overhead to see effective TPS.

Draft-Target Speculation

Uses a small draft model (10x smaller than target, usually same family). Quick to set up but highest overhead — needs to store draft model weights, activations, and KV cache.

Medusa

Fine-tunes the target model to add 2-4 extra decoder heads that generate draft tokens. No separate model needed, but limited acceptance rate.

EAGLE

Purpose-built draft model trained to consume hidden states from the target model and generate up to 8 draft tokens with high acceptance rate. The go-to algorithm for general use.

N-gram Speculation

Constructs an n-gram dictionary during prefill. Sequences can exceed 10 tokens. Mostly used for code completion where syntax is predictable and output closely matches input.

Caching#

Prefix Caching & KV Cache Re-Use

Requests sharing a common prefix can reuse cached KV values, skipping redundant prefill computation.

ARequest A

Youareahelpfulassistant.Whatis2+2?

BRequest B

Youareahelpfulassistant.Explaingravity.

CRequest C

Youareahelpfulassistant.Writeapoem.

Without cache

15 prefix tokens prefilled

With prefix cache

5 tokens (1x, not 3x)

Try it: KV Cache Sizing Calculator →

Calculate KV cache memory for different models, contexts, and precisions.

Prefix Caching and KV Cache Re-Use

Many requests share common prefixes — system prompts, few-shot examples, previous conversation turns. Prefix caching stores and reuses KV cache for these shared prefixes, skipping redundant prefill computation.

Prefix caching is especially effective for:

Multi-turn chat: Reuse KV from prior turns
Code completion: Reuse KV from file context
Agents: Reuse KV from system prompt and tools

Where to Store the KV Cache

By default, KV cache lives on GPU VRAM. Options for overflow:

CPU memory (via Grace NVLink C2C for fast access)
Distributed KV stores (across replicas)
Disk (last resort)

Cache-Aware Routing

Route requests to replicas that already hold matching prefixes. Higher cache hit rates yield lower TTFT.

Model Parallelism#

Model Parallelism Strategies

Three approaches to splitting inference across multiple GPUs.

GPU 0

Layer 1 (slice A)

Layer 2 (slice A)

Layer 3 (slice A)

GPU 1

Layer 1 (slice B)

Layer 2 (slice B)

Layer 3 (slice B)

GPU 2

Layer 1 (slice C)

Layer 2 (slice C)

Layer 3 (slice C)

GPU 3

Layer 1 (slice D)

Layer 2 (slice D)

Layer 3 (slice D)

← All-Reduce via NVLink after each layer →

Best For

Lower latency within a single node

Requires

NVLink (high-bandwidth interconnect)

Mechanism

Splits individual layers across GPUs.

When a model is too large for one GPU, split it across multiple GPUs:

Strategy	Mechanism	Best For
Tensor Parallelism (TP)	Split individual layers across GPUs	Lower latency (requires NVLink)
Expert Parallelism (EP)	Shard MoE experts across GPUs	Higher throughput for MoE models
Pipeline Parallelism (PP)	Split layers into stages	Multi-node (but introduces bubbles)

Key Takeaway

Use Tensor Parallelism within a node (where NVLink provides high bandwidth). For MoE models, Expert Parallelism enables each GPU to hold multiple full experts. Pipeline Parallelism is a last resort for multi-node setups.

Disaggregation#

Disaggregation

Separate prefill and decode onto independently scaling workers optimized for each phase.

Prefill Worker

Compute-optimized

High FLOPS GPU

KV Cache

Decode Worker

Bandwidth-optimized

High BW GPU

Request arrives

A new request with a long input prompt arrives at the router.

Disaggregation separates prefill and decode onto independently scaling hardware:

Prefill workers: Optimized for compute (high FLOPS)
Decode workers: Optimized for memory bandwidth

The KV cache computed during prefill must be transferred to decode workers. KV cache quantization alleviates this transfer bottleneck.

Use disaggregation when:

Prefill and decode have very different resource needs
Traffic patterns vary (some requests are prefill-heavy, others decode-heavy)
You need independent scaling for cost efficiency

NVIDIA Dynamo provides dynamic disaggregation, automatically routing between prefill and decode workers based on real-time load.

Check Your Understanding

1 / 13

What performance improvement does FP16→FP8 quantization typically provide?