Read Inference Engineering free online
The complete guide to LLM inference — GPU hardware, quantization, serving frameworks, and production deployment — by Philip Kiely (Baseten Books). All 8 chapters are free to read online with interactive diagrams and calculators.
Get the cheat-sheet PDF bundle
All 8 reference cheat sheets as a single PDF — GPU specs, formula sheets, quantization comparisons, and deployment checklists. Enter your email for instant access.
Free forever. Unsubscribe anytime.
What’s included
8 full chapters
Prerequisites, model architecture, GPU hardware, serving software, optimization techniques, multimodal inference, and production deployment — all online, free.
Cheat-sheet PDF bundle
All 8 reference cheat sheets as a single PDF — GPU specs, formula sheets, framework comparisons, and deployment checklists. Enter your email to get instant access.
Interactive calculators
VRAM estimator, arithmetic intensity calculator, KV cache sizer, and more — run real inference math in the browser with instant results.
100+ exercises & quizzes
Test your understanding at the end of every section with instant-feedback quizzes and worked exercises covering everything from roofline analysis to speculative decoding.
About the book
Inference Engineering is a Baseten Books title by Philip Kiely. It covers the full stack required to deploy and optimise large language models in production — starting from the transformer architecture and GPU memory hierarchy, through attention optimisations (FlashAttention, GQA, paged attention), quantization (INT8, FP8, AWQ, GPTQ), serving software (vLLM, SGLang, TensorRT-LLM), and ending with autoscaling, SLO management, and cost optimisation.
The official book page is at baseten.co/inference-engineering. This site is the interactive companion — every chapter is readable here for free, enriched with animated diagrams, calculators, and quizzes not present in the static book.