Cheat Sheets
Quick-reference cards for every chapter — formulas, GPU specs, framework comparisons, and deployment checklists. Built for engineers who already know the concepts and need the numbers fast.
Every sheet is downloadable as PDF. Pin them next to your terminal during capacity planning, code reviews, and incident triage.
Inference: Three-Layer Framework
The mental model for thinking about inference systems
Runtime, infrastructure, tooling — the three layers every inference engineer must know. TTFT, TPOT, throughput, and a layer-by-layer responsibility map.
Prerequisites: Latency, Throughput & Budgets
The math that turns workloads into hardware budgets
Five core formulas: end-to-end latency, throughput, cost-per-token, Little's Law, and MFU. Size your capacity before provisioning a single GPU.
Models: Transformer Architecture
Attention, parameter counts, and KV cache math
Scaled dot-product attention, scaling laws, VRAM-for-weights, KV cache per token, and arithmetic intensity — the numbers every model deployment needs.
Hardware: GPUs & Accelerators
GPU specs side-by-side: H100, H200, B200, A100, L4 and more
Memory, bandwidth, FP16 TFLOPS, and TDP for every datacenter GPU you'd deploy. The reference table for matching hardware to your workload.
Software: Stack & Inference Engines
vLLM, SGLang, TensorRT-LLM and the layers beneath them
The inference software stack from application to CUDA runtime. Side-by-side comparison of vLLM, SGLang, and TensorRT-LLM — and what each layer owns.
Techniques: Optimization Deep Dives
Quantization, speculative decoding, parallelism, and KV cache tricks
FP8, INT8, INT4 quantization formats with quality tradeoffs. Speculative decoding, tensor/pipeline parallelism, and KV cache compression at a glance.
Modalities: Beyond Text
VLMs, ASR, TTS, image and video generation pipelines
How vision-language, speech, and image/video models differ from text LLMs. Pipeline stages, I/O formats, and where inference bottlenecks actually occur.
Production: Autoscaling & Deployment
Replica sizing, scale-up triggers, and cooldown defaults that work
Formulas for minimum replicas, GPU count for target throughput, and queue-depth scale-up. Includes cooldown defaults that prevent autoscaler thrashing.