<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Inference Engineering</title>
    <link>https://inferenceengineering.tech/</link>
    <atom:link href="https://inferenceengineering.tech/feed.xml" rel="self" type="application/rss+xml" />
    <description>The definitive interactive guide to AI inference engineering — GPU hardware, inference engines, quantization, KV caching, and production autoscaling.</description>
    <language>en-us</language>
    <lastBuildDate>Mon, 15 Jun 2026 02:07:22 GMT</lastBuildDate>
    <managingEditor>philip@baseten.co (Philip Kiely)</managingEditor>
    <webMaster>philip@baseten.co (Philip Kiely)</webMaster>
    <copyright>Copyright 2026, Baseten Books</copyright>
    <ttl>1440</ttl>

  <item>
    <title>vLLM vs SGLang vs TensorRT-LLM</title>
    <link>https://inferenceengineering.tech/learn/vllm-vs-sglang-vs-tensorrt-llm/</link>
    <guid isPermaLink="true">https://inferenceengineering.tech/learn/vllm-vs-sglang-vs-tensorrt-llm/</guid>
    <description>The three leading open-source inference engines, compared on performance, ease of use, and hardware support — with real benchmark data and a decision framework.</description>
    <pubDate>Mon, 01 Jun 2026 00:00:00 GMT</pubDate>
    <category>Software</category>
  </item>

  <item>
    <title>GPU Inference: H100 vs A100 vs L4</title>
    <link>https://inferenceengineering.tech/learn/gpu-inference/</link>
    <guid isPermaLink="true">https://inferenceengineering.tech/learn/gpu-inference/</guid>
    <description>How GPUs actually run model inference — compute vs memory bandwidth, prefill vs decode, and why the bottleneck is almost never what you think.</description>
    <pubDate>Mon, 01 Jun 2026 00:00:00 GMT</pubDate>
    <category>Hardware</category>
  </item>

  <item>
    <title>AI Inference Hardware Guide</title>
    <link>https://inferenceengineering.tech/learn/ai-inference-hardware/</link>
    <guid isPermaLink="true">https://inferenceengineering.tech/learn/ai-inference-hardware/</guid>
    <description>The landscape of AI inference hardware — GPUs, TPUs, and dedicated inference chips — and how to compare them on the specs that actually matter.</description>
    <pubDate>Mon, 01 Jun 2026 00:00:00 GMT</pubDate>
    <category>Hardware</category>
  </item>

  <item>
    <title>LLM Inference Acceleration</title>
    <link>https://inferenceengineering.tech/learn/llm-inference-acceleration/</link>
    <guid isPermaLink="true">https://inferenceengineering.tech/learn/llm-inference-acceleration/</guid>
    <description>The complete toolkit for making LLM inference faster and cheaper — quantization, speculative decoding, KV caching, batching, and parallelism — and when each one actually helps.</description>
    <pubDate>Mon, 01 Jun 2026 00:00:00 GMT</pubDate>
    <category>Optimization</category>
  </item>

  <item>
    <title>Preface: Preface</title>
    <link>https://inferenceengineering.tech/chapters/preface/</link>
    <guid isPermaLink="true">https://inferenceengineering.tech/chapters/preface/</guid>
    <description>The explosive growth of open models and why inference engineering is the most important skill in AI.</description>
    <category>Chapters</category>
  </item>

  <item>
    <title>Chapter 0: Inference</title>
    <link>https://inferenceengineering.tech/chapters/inference/</link>
    <guid isPermaLink="true">https://inferenceengineering.tech/chapters/inference/</guid>
    <description>Introduces the three layers of inference: runtime, infrastructure, and tooling. A map of the entire book.</description>
    <category>Chapters</category>
  </item>

  <item>
    <title>Chapter 1: Prerequisites</title>
    <link>https://inferenceengineering.tech/chapters/prerequisites/</link>
    <guid isPermaLink="true">https://inferenceengineering.tech/chapters/prerequisites/</guid>
    <description>Use case definition, latency and cost budgeting, model selection and evaluation, and fine-tuning for quality.</description>
    <category>Chapters</category>
  </item>

  <item>
    <title>Chapter 2: Models</title>
    <link>https://inferenceengineering.tech/chapters/models/</link>
    <guid isPermaLink="true">https://inferenceengineering.tech/chapters/models/</guid>
    <description>Technical architecture of LLMs and diffusion models — transformers, attention, MoE, and inference bottlenecks.</description>
    <category>Chapters</category>
  </item>

  <item>
    <title>Chapter 3: Hardware</title>
    <link>https://inferenceengineering.tech/chapters/hardware/</link>
    <guid isPermaLink="true">https://inferenceengineering.tech/chapters/hardware/</guid>
    <description>GPU architecture, compute and memory, NVIDIA generations (Hopper, Blackwell, Rubin), instances, and alternatives.</description>
    <category>Chapters</category>
  </item>

  <item>
    <title>Chapter 4: Software</title>
    <link>https://inferenceengineering.tech/chapters/software/</link>
    <guid isPermaLink="true">https://inferenceengineering.tech/chapters/software/</guid>
    <description>CUDA kernels, PyTorch, model formats, inference engines (vLLM, SGLang, TensorRT-LLM), NVIDIA Dynamo, and benchmarking.</description>
    <category>Chapters</category>
  </item>

  <item>
    <title>Chapter 5: Techniques</title>
    <link>https://inferenceengineering.tech/chapters/techniques/</link>
    <guid isPermaLink="true">https://inferenceengineering.tech/chapters/techniques/</guid>
    <description>Quantization, speculative decoding, KV cache re-use, model parallelism, and disaggregation in practice.</description>
    <category>Chapters</category>
  </item>

  <item>
    <title>Chapter 6: Modalities</title>
    <link>https://inferenceengineering.tech/chapters/modalities/</link>
    <guid isPermaLink="true">https://inferenceengineering.tech/chapters/modalities/</guid>
    <description>Vision language models, embeddings, ASR, TTS, image generation, and video generation inference.</description>
    <category>Chapters</category>
  </item>

  <item>
    <title>Chapter 7: Production</title>
    <link>https://inferenceengineering.tech/chapters/production/</link>
    <guid isPermaLink="true">https://inferenceengineering.tech/chapters/production/</guid>
    <description>Containerization, autoscaling, multi-cloud, deployment, observability, and client code for production inference.</description>
    <category>Chapters</category>
  </item>
  </channel>
</rss>