Question 1

What is inference engineering?

Accepted Answer

Inference engineering is the discipline of optimizing and deploying AI models for production use. It covers three layers: runtime (GPU hardware, CUDA, model optimization), infrastructure (serving frameworks, autoscaling, multi-GPU), and tooling (developer experience, benchmarking, observability).

Question 2

How do I calculate VRAM requirements for LLM inference?

Accepted Answer

VRAM = model weights (params × bytes per param) + KV cache (2 × layers × kv_heads × head_dim × seq_len × batch × bytes) + activations (~5% of weights) + overhead (~1.5 GB). Use our interactive VRAM Calculator at inferenceengineering.tech/exercises/vram-calculator.

Question 3

What are the best inference serving frameworks?

Accepted Answer

The three leading inference engines are vLLM (best ease of use, GPU/TPU support), SGLang (excellent performance, NVIDIA/AMD), and TensorRT-LLM (raw NVIDIA performance). The best choice depends on your model, scale, and hardware.

Question 4

What GPU should I use for LLM inference?

Accepted Answer

For production LLM inference: NVIDIA H100/H200 for large models (70B+), A100 for medium models (7-70B), L4 for small models or budget deployments, and B200/B300 (Blackwell) for next-gen workloads. Use our GPU Selection Advisor at inferenceengineering.tech/exercises/gpu-advisor.

Layer	What It Does	Examples
Application	API, routing, auth, caching	FastAPI, LiteLLM, nginx
Inference Engine	Batching, KV cache mgmt, scheduling	vLLM, SGLang, TRT-LLM
DL Framework	Tensor ops, autograd, model loading	PyTorch, JAX
Kernel Libraries	Optimized GPU ops	cuDNN, CUTLASS, FlashAttention
CUDA Runtime	GPU management, memory, streams	CUDA, ROCm, OpenCL
Driver / Hardware	Physical GPU execution	NVIDIA Driver, H100, A100

Engine	Strengths	Best For
vLLM	PagedAttention, OpenAI-compatible API, broad model support	General serving, research
SGLang	RadixAttention, structured generation, low latency	Agentic workloads, constrained gen
TensorRT-LLM	NVIDIA-optimized kernels, highest throughput	Production on NVIDIA hardware
llama.cpp	CPU+GPU, GGUF quantization, low memory	Edge, local, consumer hardware
MLC LLM	Multi-platform (CUDA/Metal/WebGPU)	Cross-platform deployment

Software: Stack & Inference Engines

Software Stack Layers

Inference Engine Comparison

Key Concepts

Model Format Checklist