Question 1

What is inference engineering?

Accepted Answer

Inference engineering is the discipline of optimizing and deploying AI models for production use. It covers three layers: runtime (GPU hardware, CUDA, model optimization), infrastructure (serving frameworks, autoscaling, multi-GPU), and tooling (developer experience, benchmarking, observability).

Question 2

How do I calculate VRAM requirements for LLM inference?

Accepted Answer

VRAM = model weights (params × bytes per param) + KV cache (2 × layers × kv_heads × head_dim × seq_len × batch × bytes) + activations (~5% of weights) + overhead (~1.5 GB). Use our interactive VRAM Calculator at inferenceengineering.tech/exercises/vram-calculator.

Question 3

What are the best inference serving frameworks?

Accepted Answer

The three leading inference engines are vLLM (best ease of use, GPU/TPU support), SGLang (excellent performance, NVIDIA/AMD), and TensorRT-LLM (raw NVIDIA performance). The best choice depends on your model, scale, and hardware.

Question 4

What GPU should I use for LLM inference?

Accepted Answer

For production LLM inference: NVIDIA H100/H200 for large models (70B+), A100 for medium models (7-70B), L4 for small models or budget deployments, and B200/B300 (Blackwell) for next-gen workloads. Use our GPU Selection Advisor at inferenceengineering.tech/exercises/gpu-advisor.

Format	Bits	VRAM vs FP16	Typical Quality Loss	Notes
FP32	32	2×	None (baseline)	Training only
FP16 / BF16	16	1×	None	Standard inference
FP8	8	0.5×	< 1%	H100/H200 native, NVIDIA Hopper+
INT8 (W8A8)	8	0.5×	< 1%	SmoothQuant, LLM.int8()
INT4 (W4A16)	4	0.25×	1–3%	GPTQ, AWQ, common for LLMs
INT4 (W4A8)	4/8	0.25×	1–3%	Higher throughput variant
2-bit	2	0.125×	5–10%+	QuIP#, AQLM — aggressive

Type	What's Split	When to Use	Comm Overhead
Tensor (TP)	Weight matrices across GPUs	Model too large for 1 GPU	High (all-reduce each layer)
Pipeline (PP)	Layers across GPUs	Very deep models, large batch	Medium (bubble overhead)
Data (DP)	Batch across GPU replicas	High throughput, model fits 1 GPU	Low (gradient sync only)
Sequence (SP)	Sequence length across GPUs	Very long context	Medium (ring attention)
Expert (EP)	MoE experts across GPUs	Large MoE models	Medium (all-to-all routing)

Techniques: Optimization Deep Dives

Quantization Formats

KV Cache Formulas

Parallelism Types

Speculative Decoding