Skip to content
Free — no paywall, no login required to read

Read Inference Engineering free online

The complete guide to LLM inference — GPU hardware, quantization, serving frameworks, and production deployment — by Philip Kiely (Baseten Books). All 8 chapters are free to read online with interactive diagrams and calculators.

Free PDF bundle

Get the cheat-sheet PDF bundle

All 8 reference cheat sheets as a single PDF — GPU specs, formula sheets, quantization comparisons, and deployment checklists. Enter your email for instant access.

Free forever. Unsubscribe anytime.

What’s included

8 full chapters

Prerequisites, model architecture, GPU hardware, serving software, optimization techniques, multimodal inference, and production deployment — all online, free.

Cheat-sheet PDF bundle

All 8 reference cheat sheets as a single PDF — GPU specs, formula sheets, framework comparisons, and deployment checklists. Enter your email to get instant access.

Interactive calculators

VRAM estimator, arithmetic intensity calculator, KV cache sizer, and more — run real inference math in the browser with instant results.

100+ exercises & quizzes

Test your understanding at the end of every section with instant-feedback quizzes and worked exercises covering everything from roofline analysis to speculative decoding.

About the book

Inference Engineering is a Baseten Books title by Philip Kiely. It covers the full stack required to deploy and optimise large language models in production — starting from the transformer architecture and GPU memory hierarchy, through attention optimisations (FlashAttention, GQA, paged attention), quantization (INT8, FP8, AWQ, GPTQ), serving software (vLLM, SGLang, TensorRT-LLM), and ending with autoscaling, SLO management, and cost optimisation.

The official book page is at baseten.co/inference-engineering. This site is the interactive companion — every chapter is readable here for free, enriched with animated diagrams, calculators, and quizzes not present in the static book.