Learn Inference Engineering
The interactive guide to AI model inference — from GPU hardware and CUDA kernels to production autoscaling. With animated diagrams, calculators, and quizzes.
Get new chapters in your inbox
One email when a new chapter, calculator, or benchmark drops. No spam, unsubscribe anytime.
We use Kit (formerly ConvertKit). Double opt-in, unsubscribe anytime.
Not just a book
Learn by doing
Interactive Diagrams
Animated transformer blocks, GPU architecture explorers, and attention visualizers you can play with.
Hands-on Calculators
VRAM calculator, arithmetic intensity, KV cache sizing — run real inference math with instant feedback.
Progress Tracking
Track your reading progress, quiz scores, and exercise completion across all chapters.
Quizzes & Exercises
100+ questions with instant feedback. Test your understanding at the end of every section.
Choose your path
Guided learning tracks
8 Chapters
The complete inference stack
From prerequisites and model architecture through hardware, software, optimization techniques, multimodal inference, and production deployment.
Inference
The Three Layers of Inference
Introduces the three layers of inference: runtime, infrastructure, and tooling. A map of the entire book.
Prerequisites
Before You Optimize
Use case definition, latency and cost budgeting, model selection and evaluation, and fine-tuning for quality.
Models
Architecture and Bottlenecks
Technical architecture of LLMs and diffusion models — transformers, attention, MoE, and inference bottlenecks.
Hardware
GPUs and Accelerators
GPU architecture, compute and memory, NVIDIA generations (Hopper, Blackwell, Rubin), instances, and alternatives.
Software
From CUDA to Inference Engines
CUDA kernels, PyTorch, model formats, inference engines (vLLM, SGLang, TensorRT-LLM), NVIDIA Dynamo, and benchmarking.
Techniques
Optimization Deep Dives
Quantization, speculative decoding, KV cache re-use, model parallelism, and disaggregation in practice.
Modalities
Beyond Text
Vision language models, embeddings, ASR, TTS, image generation, and video generation inference.
Production
Ship It
Containerization, autoscaling, multi-cloud, deployment, observability, and client code for production inference.
Get started
Ready to master inference?
Start with Chapter 0 for a high-level map of inference engineering, or jump straight to the topic you need.