Inference Engineering

Inference Engineering https://inferenceengineering.tech/ The definitive interactive guide to AI inference engineering — GPU hardware, inference engines, quantization, KV caching, and production autoscaling. en-us Mon, 15 Jun 2026 02:07:22 GMT philip@baseten.co (Philip Kiely) philip@baseten.co (Philip Kiely) Copyright 2026, Baseten Books 1440 vLLM vs SGLang vs TensorRT-LLM https://inferenceengineering.tech/learn/vllm-vs-sglang-vs-tensorrt-llm/ https://inferenceengineering.tech/learn/vllm-vs-sglang-vs-tensorrt-llm/ The three leading open-source inference engines, compared on performance, ease of use, and hardware support — with real benchmark data and a decision framework. Mon, 01 Jun 2026 00:00:00 GMT Software GPU Inference: H100 vs A100 vs L4 https://inferenceengineering.tech/learn/gpu-inference/ https://inferenceengineering.tech/learn/gpu-inference/ How GPUs actually run model inference — compute vs memory bandwidth, prefill vs decode, and why the bottleneck is almost never what you think. Mon, 01 Jun 2026 00:00:00 GMT Hardware AI Inference Hardware Guide https://inferenceengineering.tech/learn/ai-inference-hardware/ https://inferenceengineering.tech/learn/ai-inference-hardware/ The landscape of AI inference hardware — GPUs, TPUs, and dedicated inference chips — and how to compare them on the specs that actually matter. Mon, 01 Jun 2026 00:00:00 GMT Hardware LLM Inference Acceleration https://inferenceengineering.tech/learn/llm-inference-acceleration/ https://inferenceengineering.tech/learn/llm-inference-acceleration/ The complete toolkit for making LLM inference faster and cheaper — quantization, speculative decoding, KV caching, batching, and parallelism — and when each one actually helps. Mon, 01 Jun 2026 00:00:00 GMT Optimization Preface: Preface https://inferenceengineering.tech/chapters/preface/ https://inferenceengineering.tech/chapters/preface/ The explosive growth of open models and why inference engineering is the most important skill in AI. Chapters Chapter 0: Inference https://inferenceengineering.tech/chapters/inference/ https://inferenceengineering.tech/chapters/inference/ Introduces the three layers of inference: runtime, infrastructure, and tooling. A map of the entire book. Chapters Chapter 1: Prerequisites https://inferenceengineering.tech/chapters/prerequisites/ https://inferenceengineering.tech/chapters/prerequisites/ Use case definition, latency and cost budgeting, model selection and evaluation, and fine-tuning for quality. Chapters Chapter 2: Models https://inferenceengineering.tech/chapters/models/ https://inferenceengineering.tech/chapters/models/ Technical architecture of LLMs and diffusion models — transformers, attention, MoE, and inference bottlenecks. Chapters Chapter 3: Hardware https://inferenceengineering.tech/chapters/hardware/ https://inferenceengineering.tech/chapters/hardware/ GPU architecture, compute and memory, NVIDIA generations (Hopper, Blackwell, Rubin), instances, and alternatives. Chapters Chapter 4: Software https://inferenceengineering.tech/chapters/software/ https://inferenceengineering.tech/chapters/software/ CUDA kernels, PyTorch, model formats, inference engines (vLLM, SGLang, TensorRT-LLM), NVIDIA Dynamo, and benchmarking. Chapters Chapter 5: Techniques https://inferenceengineering.tech/chapters/techniques/ https://inferenceengineering.tech/chapters/techniques/ Quantization, speculative decoding, KV cache re-use, model parallelism, and disaggregation in practice. Chapters Chapter 6: Modalities https://inferenceengineering.tech/chapters/modalities/ https://inferenceengineering.tech/chapters/modalities/ Vision language models, embeddings, ASR, TTS, image generation, and video generation inference. Chapters Chapter 7: Production https://inferenceengineering.tech/chapters/production/ https://inferenceengineering.tech/chapters/production/ Containerization, autoscaling, multi-cloud, deployment, observability, and client code for production inference. Chapters