Modalities: Beyond Text
4 sections · Quick reference card
Modality Pipelines
| Modality | Input Format | Key Stage | Output Format |
|---|---|---|---|
| VLM | Image + text tokens | Vision encoder → projector → LLM | Text tokens |
| ASR | Audio waveform (float32) | Encoder → CTC or attention decoder | Text transcript |
| TTS | Text tokens | Acoustic model → vocoder | Audio waveform / MP3 |
| Image Gen | Text prompt + noise latent | Diffusion UNet × N steps | Decoded image (JPEG/PNG) |
| Video Gen | Text/image prompt + noise | Spatiotemporal transformer × N steps | Frame sequence / MP4 |
| Embeddings | Text / image / audio | Encoder forward pass only | Float vector (dense) |
Vision Language Models
- Vision encoder
- ViT (Vision Transformer) encodes image into patch embeddings. Typical: 336×336 px = 576 tokens.
- Projector / adapter
- MLP or cross-attention layer maps vision embeddings into LLM token space.
- Token budget
- Each image adds hundreds of tokens to context. High-res tiling multiplies cost further.
- Dynamic resolution
- Encode image at multiple resolutions, select based on content. Used in InternVL, LLaVA-Next.
Diffusion Inference Levers
- Inference steps
- Number of denoising iterations. Fewer steps = faster but lower quality. Range: 4 (LCM) to 50 (standard).
- CFG scale
- Classifier-Free Guidance scale. Doubles compute (two forward passes). Range: 1–15.
- Latent space
- Diffusion runs in compressed latent space (e.g., 64×64 for a 512×512 image). VAE decodes at the end.
- Distilled models
- Smaller student models trained to match full model in fewer steps. SDXL-Turbo, FLUX-schnell.
Multimodal Serving Checklist
- Pre-encode images offline when possible — vision encoder is deterministic
- Cache vision embeddings by image hash
- Account for variable token counts in batching logic
- Use dynamic resolution only when image detail matters
- For diffusion: profile step count vs quality tradeoff on your prompts
- Run VAE decoder on separate stream to overlap with next request