About Inference Engineering
A free, interactive guide to running large language models in production — grounded in real benchmarks and the Baseten Books title by Philip Kiely.
Mission
Inference engineering — the discipline of deploying, optimising, and scaling LLM serving infrastructure — is one of the fastest- moving areas in machine learning. Yet the knowledge is scattered across research papers, vendor documentation, and hard-won production experience. This site exists to make that knowledge accessible in one place, free, and interactive.
The goal is not a broad survey of AI topics but a deep, practical treatment of a single question: how do you efficiently run a trained model at scale? That means covering GPU hardware, memory management, attention optimisations, quantization, serving frameworks, and the operational concerns that only surface in production.
What’s on this site
8 Chapters
The full inference stack, from GPU memory hierarchy and transformer internals through quantization, serving frameworks, and production SLOs.
Interactive Calculators
VRAM estimator, arithmetic intensity calculator, KV cache sizer — run real inference math in the browser, not just read about it.
Real Benchmarks
vLLM vs. SGLang on H100 SXM — actual throughput and latency numbers from 5,980 requests across workloads. No marketing fluff.
Learning Paths
Guided tracks tailored to MLEs, platform engineers, and researchers — curated sequences through chapters and exercises.
Cheat Sheets
Eight printable reference sheets: GPU specs, formula tables, framework comparisons, and deployment checklists.
Glossary
50+ inference-engineering terms defined clearly and cross-linked to the chapters where they appear in context.
Why trust this
Own benchmarks
Performance claims are backed by data from benchmarks we ran ourselves — not vendor datasheets. The benchmark methodology, hardware specs, and raw numbers are published openly.
Based on the book
Content is derived from Inference Engineering, a Baseten Books title by Philip Kiely (2026). The book reflects production inference work at Baseten, where Philip spent years optimising LLM serving for real customers.
Interactive-first
Every concept that can be demonstrated numerically has a calculator or diagram. Passive reading is supplemented by exercises with immediate feedback rather than end-of-chapter answer keys.
Kept current
LLM inference moves fast. Pages are updated when the state of the art changes — not just when the static book ships a new edition.
Attribution
This site is the interactive web companion to Inference Engineering by Philip Kiely, published by Baseten Books (2026). The official book page is at baseten.co/inference-engineering.
Benchmark data is collected by the Inference Engineering team on dedicated H100 SXM hardware. Methodology and raw results are published on the benchmarks page.