Skip to content

Chapters

From the fundamentals of inference to production deployment. Eight chapters covering the complete inference engineering stack.

Preface8 min read

Preface

Why Inference Engineering Matters

The explosive growth of open models and why inference engineering is the most important skill in AI.

Chapter 010 min read

Inference

The Three Layers of Inference

Introduces the three layers of inference: runtime, infrastructure, and tooling. A map of the entire book.

The Inference LifecycleThe Three LayersRuntimeInfrastructure+2 more
Chapter 120 min read

Prerequisites

Before You Optimize

Use case definition, latency and cost budgeting, model selection and evaluation, and fine-tuning for quality.

Scale and SpecializationAbout Your AppModel SelectionMeasuring Latency and Throughput
Chapter 235 min read

Models

Architecture and Bottlenecks

Technical architecture of LLMs and diffusion models — transformers, attention, MoE, and inference bottlenecks.

Neural NetworksLLM Inference MechanicsImage Generation InferenceCalculating Inference Bottlenecks+1 more
Chapter 325 min read

Hardware

GPUs and Accelerators

GPU architecture, compute and memory, NVIDIA generations (Hopper, Blackwell, Rubin), instances, and alternatives.

GPU ArchitectureGPU Architecture GenerationsInstancesOther Datacenter Accelerators+1 more
Chapter 425 min read

Software

From CUDA to Inference Engines

CUDA kernels, PyTorch, model formats, inference engines (vLLM, SGLang, TensorRT-LLM), NVIDIA Dynamo, and benchmarking.

CUDADeep Learning FrameworksInference EnginesNVIDIA Dynamo+1 more
Chapter 540 min read

Techniques

Optimization Deep Dives

Quantization, speculative decoding, KV cache re-use, model parallelism, and disaggregation in practice.

QuantizationSpeculative DecodingCachingModel Parallelism+1 more
Chapter 630 min read

Modalities

Beyond Text

Vision language models, embeddings, ASR, TTS, image generation, and video generation inference.

Vision Language ModelsEmbedding ModelsASR ModelsTTS Models+2 more
Chapter 735 min read

Production

Ship It

Containerization, autoscaling, multi-cloud, deployment, observability, and client code for production inference.

ContainerizationAutoscalingMulti-Cloud Capacity ManagementTesting and Deployment+1 more