Site mission

About Inference Engineering

A free, interactive guide to running large language models in production — grounded in real benchmarks and the Baseten Books title by Philip Kiely.

Mission

Inference engineering — the discipline of deploying, optimising, and scaling LLM serving infrastructure — is one of the fastest- moving areas in machine learning. Yet the knowledge is scattered across research papers, vendor documentation, and hard-won production experience. This site exists to make that knowledge accessible in one place, free, and interactive.

The goal is not a broad survey of AI topics but a deep, practical treatment of a single question: how do you efficiently run a trained model at scale? That means covering GPU hardware, memory management, attention optimisations, quantization, serving frameworks, and the operational concerns that only surface in production.

What’s on this site

8 Chapters

The full inference stack, from GPU memory hierarchy and transformer internals through quantization, serving frameworks, and production SLOs.

Interactive Calculators

VRAM estimator, arithmetic intensity calculator, KV cache sizer — run real inference math in the browser, not just read about it.

Real Benchmarks

vLLM vs. SGLang on H100 SXM — actual throughput and latency numbers from 5,980 requests across workloads. No marketing fluff.

Learning Paths

Guided tracks tailored to MLEs, platform engineers, and researchers — curated sequences through chapters and exercises.

Cheat Sheets

Eight printable reference sheets: GPU specs, formula tables, framework comparisons, and deployment checklists.

Glossary

50+ inference-engineering terms defined clearly and cross-linked to the chapters where they appear in context.

Why trust this

Own benchmarks

Performance claims are backed by data from benchmarks we ran ourselves — not vendor datasheets. The benchmark methodology, hardware specs, and raw numbers are published openly.

Based on the book

Content is derived from Inference Engineering, a Baseten Books title by Philip Kiely (2026). The book reflects production inference work at Baseten, where Philip spent years optimising LLM serving for real customers.

Interactive-first

Every concept that can be demonstrated numerically has a calculator or diagram. Passive reading is supplemented by exercises with immediate feedback rather than end-of-chapter answer keys.

Kept current

LLM inference moves fast. Pages are updated when the state of the art changes — not just when the static book ships a new edition.

Attribution

This site is the interactive web companion to Inference Engineering by Philip Kiely, published by Baseten Books (2026). The official book page is at baseten.co/inference-engineering.

Benchmark data is collected by the Inference Engineering team on dedicated H100 SXM hardware. Methodology and raw results are published on the benchmarks page.

Chapters →Glossary →Benchmarks →Free PDF →