Skip to content
Cheat Sheets/Modalities: Beyond Text

Modalities: Beyond Text

4 sections · Quick reference card

Modality Pipelines

ModalityInput FormatKey StageOutput Format
VLMImage + text tokensVision encoder → projector → LLMText tokens
ASRAudio waveform (float32)Encoder → CTC or attention decoderText transcript
TTSText tokensAcoustic model → vocoderAudio waveform / MP3
Image GenText prompt + noise latentDiffusion UNet × N stepsDecoded image (JPEG/PNG)
Video GenText/image prompt + noiseSpatiotemporal transformer × N stepsFrame sequence / MP4
EmbeddingsText / image / audioEncoder forward pass onlyFloat vector (dense)

Vision Language Models

Vision encoder
ViT (Vision Transformer) encodes image into patch embeddings. Typical: 336×336 px = 576 tokens.
Projector / adapter
MLP or cross-attention layer maps vision embeddings into LLM token space.
Token budget
Each image adds hundreds of tokens to context. High-res tiling multiplies cost further.
Dynamic resolution
Encode image at multiple resolutions, select based on content. Used in InternVL, LLaVA-Next.

Diffusion Inference Levers

Inference steps
Number of denoising iterations. Fewer steps = faster but lower quality. Range: 4 (LCM) to 50 (standard).
CFG scale
Classifier-Free Guidance scale. Doubles compute (two forward passes). Range: 1–15.
Latent space
Diffusion runs in compressed latent space (e.g., 64×64 for a 512×512 image). VAE decodes at the end.
Distilled models
Smaller student models trained to match full model in fewer steps. SDXL-Turbo, FLUX-schnell.

Multimodal Serving Checklist

  • Pre-encode images offline when possible — vision encoder is deterministic
  • Cache vision embeddings by image hash
  • Account for variable token counts in batching logic
  • Use dynamic resolution only when image detail matters
  • For diffusion: profile step count vs quality tradeoff on your prompts
  • Run VAE decoder on separate stream to overlap with next request