Skip to content
Chapter 0

Inference

The Three Layers of Inference

10 min read6 sections
📖Inference

Inference is the second phase in a generative AI model's lifecycle. While training is the process of learning model weights from data, inference is serving generative AI models in production.

The Inference Lifecycle#

In last decade's machine learning (ML) boom, hundreds of thousands of data scientists and ML engineers became familiar with the full lifecycle — both training and inference — for ML models.

Inference for classic ML models is relatively straightforward. In the early days of Baseten, inference ran on models built with tools like XGBoost on lightweight CPUs with a simple software stack.

In contrast, inference for generative AI models is complex. You can't simply take model weights, get some GPUs, and expect inference to be fast and reliable enough for large-scale production use.

The Three Layers#

Doing inference well requires three layers:

  • Runtime: Optimizing the performance of a single model on a single GPU-backed instance
  • Infrastructure: Scaling across clusters, regions, and clouds without creating silos while maintaining excellent uptime
  • Tooling: Providing engineers working on inference with the right level of abstraction to balance control with productivity

These three layers must work together to create a system that can handle mission-critical inference at scale.

Key Takeaway

A complete inference platform requires all three layers — runtime, infrastructure, and tooling — working in concert. Optimizing just one layer isn't enough for production-grade inference.

Runtime#

The runtime layer is responsible for ensuring that an individual model running on a GPU (or across several GPUs in a single instance) is running as performantly and efficiently as possible.

This layer depends on a sophisticated software stack, from CUDA to PyTorch to inference engines like vLLM, SGLang, and TensorRT-LLM. Low-level optimization is important, with kernels like FlashAttention delivering significant performance gains.

The Software Stack

The software stack for inference includes multiple layers of abstraction:

  1. CUDA — Low-level GPU programming framework
  2. PyTorch — Deep learning framework for model execution
  3. Inference Engines — vLLM, SGLang, TensorRT-LLM for optimized serving
  4. FlashAttention — Optimized attention kernels

Model Performance Techniques

The runtime layer relies on a number of model performance techniques that apply new research to the unique challenges of inference:

  • Batching: Run incoming requests in parallel, weaving them together on a token-by-token basis to increase throughput
  • Caching: Re-use the KV cache — the cached results of the attention algorithm — between requests that share prefixes
  • Quantization: Lower the precision of select pieces of the model to access more compute and reduce memory burden
  • Speculation: Generate and validate draft tokens to produce more than one token per forward pass during decode
  • Parallelism: Efficiently leverage more than one GPU to accelerate large models without introducing new bottlenecks
  • Disaggregation: Separate the two phases of LLM inference, prefill and decode, onto independently scaling workers
💡Beyond LLMs

These model performance techniques are used for models of all modalities, not just LLMs. Vision language models, embedding models, ASR, speech synthesis, image generation, and video generation all require their own inference optimizations.

Infrastructure#

These runtime optimizations are not enough. No matter how performant a single instance of a model server is, it will eventually receive more traffic than it can handle.

This is not a CUDA problem or a PyTorch problem. This is a systems problem that needs to be solved at the infrastructure layer.

The nature of infrastructure problems changes with each level of scale:

  1. Autoscaling — Knowing when to add and remove replicas and doing so quickly
  2. Capacity Management — Past a few hundred GPUs, spreading workloads across multiple regions and cloud providers
  3. Global Unification — Treating all available resources as a single unified pool of compute, eliminating silos

Thoughtful multi-cloud infrastructure also improves reliability, protecting against downtime in any individual region or cloud provider. For global applications, running inference near end users improves end-to-end latency.

Tooling & Developer Experience#

Once runtime and infrastructure capabilities are built, they need to be presented at the appropriate level of abstraction. Both inference providers like Baseten and internal teams building inference need to consider what tooling and developer experience to provide.

Developer experience for inference exists on a spectrum:

  • Black box — Give a platform model weights, get back an API
  • Raw primitives — Providing only basic constructs for compute, network, disk

Key Takeaway

The right developer experience is somewhere in the middle, where inference engineers have enough control to run mission-critical inference confidently, but enough abstraction to work productively.

Book Roadmap#

Inference Engineering presents a map of the technologies and techniques that power inference across all three layers of runtime, infrastructure, and tooling:

ChapterTopicFocus
Ch 1PrerequisitesUse case definition, latency budgets, model selection
Ch 2ModelsLLM & diffusion architecture, attention, bottlenecks
Ch 3HardwareGPU architecture, NVIDIA generations, accelerators
Ch 4SoftwareCUDA, PyTorch, inference engines, benchmarking
Ch 5TechniquesQuantization, speculation, caching, parallelism
Ch 6ModalitiesVLMs, ASR, TTS, image & video generation
Ch 7ProductionAutoscaling, multi-cloud, deployment, observability

Check Your Understanding

1 / 10

What are the three layers of a complete inference platform?