Cheat Sheets

Quick-reference cards for every chapter — formulas, GPU specs, framework comparisons, and deployment checklists. Built for engineers who already know the concepts and need the numbers fast.

Every sheet is downloadable as PDF. Pin them next to your terminal during capacity planning, code reviews, and incident triage.

Ch 04 sections

Inference: Three-Layer Framework

The mental model for thinking about inference systems

Runtime, infrastructure, tooling — the three layers every inference engineer must know. TTFT, TPOT, throughput, and a layer-by-layer responsibility map.

definitions + table + checklistView sheet →

Ch 14 sections

Prerequisites: Latency, Throughput & Budgets

The math that turns workloads into hardware budgets

Five core formulas: end-to-end latency, throughput, cost-per-token, Little's Law, and MFU. Size your capacity before provisioning a single GPU.

formulas + table + definitions + checklistView sheet →

Ch 24 sections

Models: Transformer Architecture

Attention, parameter counts, and KV cache math

Scaled dot-product attention, scaling laws, VRAM-for-weights, KV cache per token, and arithmetic intensity — the numbers every model deployment needs.

formulas + table + definitionsView sheet →

Ch 34 sections

Hardware: GPUs & Accelerators

GPU specs side-by-side: H100, H200, B200, A100, L4 and more

Memory, bandwidth, FP16 TFLOPS, and TDP for every datacenter GPU you'd deploy. The reference table for matching hardware to your workload.

table + definitions + checklistView sheet →

Ch 44 sections

Software: Stack & Inference Engines

vLLM, SGLang, TensorRT-LLM and the layers beneath them

The inference software stack from application to CUDA runtime. Side-by-side comparison of vLLM, SGLang, and TensorRT-LLM — and what each layer owns.

table + definitions + checklistView sheet →

Ch 54 sections

Techniques: Optimization Deep Dives

Quantization, speculative decoding, parallelism, and KV cache tricks

FP8, INT8, INT4 quantization formats with quality tradeoffs. Speculative decoding, tensor/pipeline parallelism, and KV cache compression at a glance.

table + formulas + definitionsView sheet →

Ch 64 sections

Modalities: Beyond Text

VLMs, ASR, TTS, image and video generation pipelines

How vision-language, speech, and image/video models differ from text LLMs. Pipeline stages, I/O formats, and where inference bottlenecks actually occur.

table + definitions + checklistView sheet →

Ch 74 sections

Production: Autoscaling & Deployment

Replica sizing, scale-up triggers, and cooldown defaults that work

Formulas for minimum replicas, GPU count for target throughput, and queue-depth scale-up. Includes cooldown defaults that prevent autoscaler thrashing.

formulas + checklist + table + definitionsView sheet →