What is inference engineering?

Inference engineering is the discipline of optimizing and deploying AI models for production use. It covers three layers: runtime (GPU hardware, CUDA, model optimization), infrastructure (serving frameworks, autoscaling, multi-GPU), and tooling (developer experience, benchmarking, observability).

How do I calculate VRAM requirements for LLM inference?

VRAM = model weights (params × bytes per param) + KV cache (2 × layers × kv_heads × head_dim × seq_len × batch × bytes) + activations (~5% of weights) + overhead (~1.5 GB). Use our interactive VRAM Calculator at inferenceengineering.tech/exercises/vram-calculator.

What are the best inference serving frameworks?

The three leading inference engines are vLLM (best ease of use, GPU/TPU support), SGLang (excellent performance, NVIDIA/AMD), and TensorRT-LLM (raw NVIDIA performance). The best choice depends on your model, scale, and hardware.

What GPU should I use for LLM inference?

For production LLM inference: NVIDIA H100/H200 for large models (70B+), A100 for medium models (7-70B), L4 for small models or budget deployments, and B200/B300 (Blackwell) for next-gen workloads. Use our GPU Selection Advisor at inferenceengineering.tech/exercises/gpu-advisor.

Chapter 6: Modalities | Inference Engineering

Many of the same inference engines and techniques used for LLMs also apply to related modalities. Image and video generation rely on iterative denoising with different optimization details. For each modality, you need to adjust how you think about latency, throughput, and quality.

Modality Pipelines

Each modality has a different inference pipeline with its own optimization strategies.

Image + Text

Vision Encoder

Token Merging

LLM Decoder

Text

Vision Encoder

~2B params. Converts images to visual tokens (~1000 per image).

Token Merging

Visual tokens are merged with text tokens into a unified sequence.

LLM Decoder

The large component (100B+ params). Processes combined sequence, generates text output.

Optimization

KV cache quantization for long visual contexts. Prefix caching for repeated images in multi-turn chats.

Vision Language Models#

Vision language models (VLMs) take images or videos plus a text prompt and generate text responses. A VLM consists of:

LLM: A standard large language model (the large component)
Vision encoder: A small model that converts raw images/videos into image tokens

As a rule of thumb, a high-resolution input image adds about 1,000 visual tokens to the input sequence. The primary challenge is handling longer input sequences and larger KV caches.

All Chapter 5 techniques apply: quantization (especially KV cache), speculation (EAGLE), prefix caching, tensor parallelism, and disaggregation. VLMs also introduce a quality-speed tradeoff in downsampling — converting images to visual tokens at various resolutions.

Video Processing for VLMs

One second of video at 24fps = 24 frames. Each frame ≈ 1,000 tokens. A 4-second clip → ~100,000 tokens. Downsampling is practically obligatory. Prefix caching and KV cache offloading become even more important.

The trend toward "omni" models that accept multiple input types and produce multiple output types. Running production inference often involves coordinating a pipeline of models — each must be individually optimized and should scale independently.

Embedding Models#

Embedding models transform variable-length input into fixed-length vectors that capture semantic meaning. Two traffic profiles:

High-throughput backfills: Bulk indexing of millions of documents
Low-latency lookups: Individual user-facing queries

Two architectures: BERT-style (encoder-only, under 1B params) and LLM-based (8B params or fewer, substantially better quality). Modern models use Matryoshka representations for dynamic dimensionality tradeoffs.

Key Takeaway

For embedding models, scale horizontally (each GPU as its own replica) rather than using multi-GPU parallelism. Batching is critical — embedding models support much higher batch sizes than LLMs.

ASR Models#

Automatic Speech Recognition (ASR) — audio in, text out. The most popular open model is Whisper (1.55B parameters). An encoder-decoder model where the decoder dominates inference time.

Single-Chunk Optimization

For real-time transcription, target 200ms round-trip time. The biggest upgrade is streaming over WebSockets — continuous audio in, continuous text out.

Long File Optimization

Measure with Real-Time Factor (RTF). Optimized Whisper can transcribe an hour of audio in under 4 seconds (RTF 1000x). Pipeline: VAD → parallel chunk processing → stitching.

Diarization

Speaker identification uses a pipeline of classic ML models (segmentation, embedding, clustering), not transformers. Even optimized, diarization takes 2x+ longer than transcription.

TTS Models#

Text-to-speech models are fine-tuned LLMs (e.g., Orpheus TTS from Llama 3.2 3B). Same runtime optimizations apply: TensorRT-LLM, FP8 quantization, MIGs.

Key metrics: TTFB (time to first byte), time to first sentence, TPS (80-100 tok/s needed for real-time audio). Beyond real-time speed, scale throughput via concurrent streams.

Streaming over WebSockets is again the biggest infrastructure unlock.

Speech-to-Speech Models

Unify ASR + LLM + TTS into a single model. At publication, no commercially viable open options exist, but research is active. Most production voice systems use the cascading multi-model pipeline.

Image Generation Models#

Working with image/video generation is entirely different from LLMs:

Architecture: Iterative denoisers, not autoregressive generators
Tooling: PyTorch, TensorRT, and Diffusers (SGLang Diffusion and vLLM Omni are brand new)
Optimization: CUDA kernels and custom pipelines rather than inference engine configs

Kernel Optimization

Image generation benefits from:

Fused attention kernels (SageAttention for quality-preserving FP8 attention)
Quantization (FP8 for denoiser weights and activations)
Compilation (torch.compile for the denoising loop)

One Weird Trick

Run the text encoder and VAE concurrently with the denoiser on separate hardware. The text encoder runs once and the VAE runs once, while the denoiser runs 30-50 steps — overlapping these saves meaningful time.

Video Generation Models#

Video generation extends image generation to 3D latent space (X, Y, Time). Models are 3-5x larger with 10-100x more latent space data.

Key optimizations:

Attention optimization: SageAttention and TeaCache for skipping redundant denoising steps
Quantization: FP8 for the denoiser
Context Parallelism: Replicates weights across GPUs and partitions the attention context — essential for video's massive latent space

Check Your Understanding

1 / 11

How many visual tokens does a high-resolution image add to a VLM input?

Modality Pipelines

Vision Language Models#

Video Processing for VLMs

Omni-Modal Models

Embedding Models#

Key Takeaway

ASR Models#

Single-Chunk Optimization

Long File Optimization

Diarization

TTS Models#

Speech-to-Speech Models

Image Generation Models#

Kernel Optimization

One Weird Trick

Video Generation Models#

Check Your Understanding

Modality Pipelines