Cheat Sheets
Quick-reference cards for every chapter. Key formulas, GPU specs, configuration tables, and deployment checklists. Download any sheet as a PDF.
Inference: Three-Layer Framework
The Three Layers · Key Terms · Runtime / Infra / Tooling Split · Inference Lifecycle Steps
Prerequisites: Latency, Throughput & Budgets
Core Formulas · Latency Budget Targets · Key Definitions · Model Selection Checklist
Models: Transformer Architecture
Core Formulas · Prefill vs. Decode · Transformer Block Components · MoE vs Dense
Hardware: GPUs & Accelerators
GPU Specs Reference · Memory Hierarchy · Key Hardware Concepts · Sizing Checklist
Software: Stack & Inference Engines
Software Stack Layers · Inference Engine Comparison · Key Concepts · Model Format Checklist
Techniques: Optimization Deep Dives
Quantization Formats · KV Cache Formulas · Parallelism Types · Speculative Decoding
Modalities: Beyond Text
Modality Pipelines · Vision Language Models · Diffusion Inference Levers · Multimodal Serving Checklist
Production: Autoscaling & Deployment
Autoscaling Formulas · Cold Start Mitigation Checklist · Batching Strategies · Observability Key Metrics