What is inference engineering?

Inference engineering is the discipline of optimizing and deploying AI models for production use. It covers three layers: runtime (GPU hardware, CUDA, model optimization), infrastructure (serving frameworks, autoscaling, multi-GPU), and tooling (developer experience, benchmarking, observability).

How do I calculate VRAM requirements for LLM inference?

VRAM = model weights (params × bytes per param) + KV cache (2 × layers × kv_heads × head_dim × seq_len × batch × bytes) + activations (~5% of weights) + overhead (~1.5 GB). Use our interactive VRAM Calculator at inferenceengineering.tech/exercises/vram-calculator.

What are the best inference serving frameworks?

The three leading inference engines are vLLM (best ease of use, GPU/TPU support), SGLang (excellent performance, NVIDIA/AMD), and TensorRT-LLM (raw NVIDIA performance). The best choice depends on your model, scale, and hardware.

What GPU should I use for LLM inference?

For production LLM inference: NVIDIA H100/H200 for large models (70B+), A100 for medium models (7-70B), L4 for small models or budget deployments, and B200/B300 (Blackwell) for next-gen workloads. Use our GPU Selection Advisor at inferenceengineering.tech/exercises/gpu-advisor.

Chapter 0: Inference | Inference Engineering

📖Inference

Inference is the second phase in a generative AI model's lifecycle. While training is the process of learning model weights from data, inference is serving generative AI models in production.

The Inference Lifecycle#

In last decade's machine learning (ML) boom, hundreds of thousands of data scientists and ML engineers became familiar with the full lifecycle — both training and inference — for ML models.

Inference for classic ML models is relatively straightforward. In the early days of Baseten, inference ran on models built with tools like XGBoost on lightweight CPUs with a simple software stack.

In contrast, inference for generative AI models is complex. You can't simply take model weights, get some GPUs, and expect inference to be fast and reliable enough for large-scale production use.

The Three Layers#

Doing inference well requires three layers:

Runtime: Optimizing the performance of a single model on a single GPU-backed instance
Infrastructure: Scaling across clusters, regions, and clouds without creating silos while maintaining excellent uptime
Tooling: Providing engineers working on inference with the right level of abstraction to balance control with productivity

These three layers must work together to create a system that can handle mission-critical inference at scale.

The Inference Stack

Click each layer to explore. All three must work together for production-grade inference.

Key Takeaway

A complete inference platform requires all three layers — runtime, infrastructure, and tooling — working in concert. Optimizing just one layer isn't enough for production-grade inference.

Runtime#

The runtime layer is responsible for ensuring that an individual model running on a GPU (or across several GPUs in a single instance) is running as performantly and efficiently as possible.

This layer depends on a sophisticated software stack, from CUDA to PyTorch to inference engines like vLLM, SGLang, and TensorRT-LLM. Low-level optimization is important, with kernels like FlashAttention delivering significant performance gains.

The Software Stack

The software stack for inference includes multiple layers of abstraction:

CUDA — Low-level GPU programming framework
PyTorch — Deep learning framework for model execution
Inference Engines — vLLM, SGLang, TensorRT-LLM for optimized serving
FlashAttention — Optimized attention kernels

Model Performance Techniques

The runtime layer relies on a number of model performance techniques that apply new research to the unique challenges of inference:

Batching: Run incoming requests in parallel, weaving them together on a token-by-token basis to increase throughput
Caching: Re-use the KV cache — the cached results of the attention algorithm — between requests that share prefixes
Quantization: Lower the precision of select pieces of the model to access more compute and reduce memory burden
Speculation: Generate and validate draft tokens to produce more than one token per forward pass during decode
Parallelism: Efficiently leverage more than one GPU to accelerate large models without introducing new bottlenecks
Disaggregation: Separate the two phases of LLM inference, prefill and decode, onto independently scaling workers

💡Beyond LLMs

These model performance techniques are used for models of all modalities, not just LLMs. Vision language models, embedding models, ASR, speech synthesis, image generation, and video generation all require their own inference optimizations.

Infrastructure#

These runtime optimizations are not enough. No matter how performant a single instance of a model server is, it will eventually receive more traffic than it can handle.

This is not a CUDA problem or a PyTorch problem. This is a systems problem that needs to be solved at the infrastructure layer.

The nature of infrastructure problems changes with each level of scale:

Autoscaling — Knowing when to add and remove replicas and doing so quickly
Capacity Management — Past a few hundred GPUs, spreading workloads across multiple regions and cloud providers
Global Unification — Treating all available resources as a single unified pool of compute, eliminating silos

Thoughtful multi-cloud infrastructure also improves reliability, protecting against downtime in any individual region or cloud provider. For global applications, running inference near end users improves end-to-end latency.

Tooling & Developer Experience#

Once runtime and infrastructure capabilities are built, they need to be presented at the appropriate level of abstraction. Both inference providers like Baseten and internal teams building inference need to consider what tooling and developer experience to provide.

Developer experience for inference exists on a spectrum:

Black box — Give a platform model weights, get back an API
Raw primitives — Providing only basic constructs for compute, network, disk

Key Takeaway

The right developer experience is somewhere in the middle, where inference engineers have enough control to run mission-critical inference confidently, but enough abstraction to work productively.

Book Roadmap#

Inference Engineering presents a map of the technologies and techniques that power inference across all three layers of runtime, infrastructure, and tooling:

Chapter	Topic	Focus
Ch 1	Prerequisites	Use case definition, latency budgets, model selection
Ch 2	Models	LLM & diffusion architecture, attention, bottlenecks
Ch 3	Hardware	GPU architecture, NVIDIA generations, accelerators
Ch 4	Software	CUDA, PyTorch, inference engines, benchmarking
Ch 5	Techniques	Quantization, speculation, caching, parallelism
Ch 6	Modalities	VLMs, ASR, TTS, image & video generation
Ch 7	Production	Autoscaling, multi-cloud, deployment, observability

Check Your Understanding

1 / 10

What are the three layers of a complete inference platform?