Skip to content
Chapter 6

Modalities

Beyond Text

30 min read6 sections

Many of the same inference engines and techniques used for LLMs also apply to related modalities. Image and video generation rely on iterative denoising with different optimization details. For each modality, you need to adjust how you think about latency, throughput, and quality.

Vision Language Models#

Vision language models (VLMs) take images or videos plus a text prompt and generate text responses. A VLM consists of:

  • LLM: A standard large language model (the large component)
  • Vision encoder: A small model that converts raw images/videos into image tokens

As a rule of thumb, a high-resolution input image adds about 1,000 visual tokens to the input sequence. The primary challenge is handling longer input sequences and larger KV caches.

All Chapter 5 techniques apply: quantization (especially KV cache), speculation (EAGLE), prefix caching, tensor parallelism, and disaggregation. VLMs also introduce a quality-speed tradeoff in downsampling — converting images to visual tokens at various resolutions.

Video Processing for VLMs

One second of video at 24fps = 24 frames. Each frame ≈ 1,000 tokens. A 4-second clip → ~100,000 tokens. Downsampling is practically obligatory. Prefix caching and KV cache offloading become even more important.

Omni-Modal Models

The trend toward "omni" models that accept multiple input types and produce multiple output types. Running production inference often involves coordinating a pipeline of models — each must be individually optimized and should scale independently.

Embedding Models#

Embedding models transform variable-length input into fixed-length vectors that capture semantic meaning. Two traffic profiles:

  1. High-throughput backfills: Bulk indexing of millions of documents
  2. Low-latency lookups: Individual user-facing queries

Two architectures: BERT-style (encoder-only, under 1B params) and LLM-based (8B params or fewer, substantially better quality). Modern models use Matryoshka representations for dynamic dimensionality tradeoffs.

Key Takeaway

For embedding models, scale horizontally (each GPU as its own replica) rather than using multi-GPU parallelism. Batching is critical — embedding models support much higher batch sizes than LLMs.

ASR Models#

Automatic Speech Recognition (ASR) — audio in, text out. The most popular open model is Whisper (1.55B parameters). An encoder-decoder model where the decoder dominates inference time.

Single-Chunk Optimization

For real-time transcription, target 200ms round-trip time. The biggest upgrade is streaming over WebSockets — continuous audio in, continuous text out.

Long File Optimization

Measure with Real-Time Factor (RTF). Optimized Whisper can transcribe an hour of audio in under 4 seconds (RTF 1000x). Pipeline: VAD → parallel chunk processing → stitching.

Diarization

Speaker identification uses a pipeline of classic ML models (segmentation, embedding, clustering), not transformers. Even optimized, diarization takes 2x+ longer than transcription.

TTS Models#

Text-to-speech models are fine-tuned LLMs (e.g., Orpheus TTS from Llama 3.2 3B). Same runtime optimizations apply: TensorRT-LLM, FP8 quantization, MIGs.

Key metrics: TTFB (time to first byte), time to first sentence, TPS (80-100 tok/s needed for real-time audio). Beyond real-time speed, scale throughput via concurrent streams.

Streaming over WebSockets is again the biggest infrastructure unlock.

Speech-to-Speech Models

Unify ASR + LLM + TTS into a single model. At publication, no commercially viable open options exist, but research is active. Most production voice systems use the cascading multi-model pipeline.

Image Generation Models#

Working with image/video generation is entirely different from LLMs:

  • Architecture: Iterative denoisers, not autoregressive generators
  • Tooling: PyTorch, TensorRT, and Diffusers (SGLang Diffusion and vLLM Omni are brand new)
  • Optimization: CUDA kernels and custom pipelines rather than inference engine configs

Kernel Optimization

Image generation benefits from:

  • Fused attention kernels (SageAttention for quality-preserving FP8 attention)
  • Quantization (FP8 for denoiser weights and activations)
  • Compilation (torch.compile for the denoising loop)

One Weird Trick

Run the text encoder and VAE concurrently with the denoiser on separate hardware. The text encoder runs once and the VAE runs once, while the denoiser runs 30-50 steps — overlapping these saves meaningful time.

Video Generation Models#

Video generation extends image generation to 3D latent space (X, Y, Time). Models are 3-5x larger with 10-100x more latent space data.

Key optimizations:

  • Attention optimization: SageAttention and TeaCache for skipping redundant denoising steps
  • Quantization: FP8 for the denoiser
  • Context Parallelism: Replicates weights across GPUs and partitions the attention context — essential for video's massive latent space

Check Your Understanding

1 / 11

How many visual tokens does a high-resolution image add to a VLM input?