Chapters
From the fundamentals of inference to production deployment. Eight chapters covering the complete inference engineering stack.
Preface
Why Inference Engineering Matters
The explosive growth of open models and why inference engineering is the most important skill in AI.
Inference
The Three Layers of Inference
Introduces the three layers of inference: runtime, infrastructure, and tooling. A map of the entire book.
Prerequisites
Before You Optimize
Use case definition, latency and cost budgeting, model selection and evaluation, and fine-tuning for quality.
Models
Architecture and Bottlenecks
Technical architecture of LLMs and diffusion models — transformers, attention, MoE, and inference bottlenecks.
Hardware
GPUs and Accelerators
GPU architecture, compute and memory, NVIDIA generations (Hopper, Blackwell, Rubin), instances, and alternatives.
Software
From CUDA to Inference Engines
CUDA kernels, PyTorch, model formats, inference engines (vLLM, SGLang, TensorRT-LLM), NVIDIA Dynamo, and benchmarking.
Techniques
Optimization Deep Dives
Quantization, speculative decoding, KV cache re-use, model parallelism, and disaggregation in practice.
Modalities
Beyond Text
Vision language models, embeddings, ASR, TTS, image generation, and video generation inference.
Production
Ship It
Containerization, autoscaling, multi-cloud, deployment, observability, and client code for production inference.