Question 1

What is inference engineering?

Accepted Answer

Inference engineering is the discipline of optimizing and deploying AI models for production use. It covers three layers: runtime (GPU hardware, CUDA, model optimization), infrastructure (serving frameworks, autoscaling, multi-GPU), and tooling (developer experience, benchmarking, observability).

Question 2

How do I calculate VRAM requirements for LLM inference?

Accepted Answer

VRAM = model weights (params × bytes per param) + KV cache (2 × layers × kv_heads × head_dim × seq_len × batch × bytes) + activations (~5% of weights) + overhead (~1.5 GB). Use our interactive VRAM Calculator at inferenceengineering.tech/exercises/vram-calculator.

Question 3

What are the best inference serving frameworks?

Accepted Answer

The three leading inference engines are vLLM (best ease of use, GPU/TPU support), SGLang (excellent performance, NVIDIA/AMD), and TensorRT-LLM (raw NVIDIA performance). The best choice depends on your model, scale, and hardware.

Question 4

What GPU should I use for LLM inference?

Accepted Answer

For production LLM inference: NVIDIA H100/H200 for large models (70B+), A100 for medium models (7-70B), L4 for small models or budget deployments, and B200/B300 (Blackwell) for next-gen workloads. Use our GPU Selection Advisor at inferenceengineering.tech/exercises/gpu-advisor.

GPU	Architecture	HBM (GB)	Mem BW (TB/s)	FP16 TFLOPS	TDP (W)
H100 SXM5	Hopper	80	3.35	989	700
H200 SXM5	Hopper	141	4.8	989	700
B200 SXM6	Blackwell	192	8.0	2250	1000
B300 SXM6	Blackwell Ultra	288	8.0	2500	1000
A100 SXM4	Ampere	80	2.0	312	400
L4	Ada Lovelace	24	0.3	121	72

Level	Capacity	Bandwidth	Latency
Registers	~256 KB/SM	~100 TB/s	~1 cycle
L1 / Shared Mem	128–256 KB/SM	~19 TB/s	~20 cycles
L2 Cache	50–96 MB	~7 TB/s	~200 cycles
HBM (VRAM)	80–288 GB	3–8 TB/s	~600 cycles
NVLink	multi-GPU	0.9 TB/s	~1 μs
PCIe 5.0	CPU-GPU	0.128 TB/s	~5 μs

Hardware: GPUs & Accelerators

GPU Specs Reference

Memory Hierarchy

Key Hardware Concepts

Sizing Checklist