What is inference engineering?

Inference engineering is the discipline of optimizing and deploying AI models for production use. It covers three layers: runtime (GPU hardware, CUDA, model optimization), infrastructure (serving frameworks, autoscaling, multi-GPU), and tooling (developer experience, benchmarking, observability).

How do I calculate VRAM requirements for LLM inference?

VRAM = model weights (params × bytes per param) + KV cache (2 × layers × kv_heads × head_dim × seq_len × batch × bytes) + activations (~5% of weights) + overhead (~1.5 GB). Use our interactive VRAM Calculator at inferenceengineering.tech/exercises/vram-calculator.

What are the best inference serving frameworks?

The three leading inference engines are vLLM (best ease of use, GPU/TPU support), SGLang (excellent performance, NVIDIA/AMD), and TensorRT-LLM (raw NVIDIA performance). The best choice depends on your model, scale, and hardware.

What GPU should I use for LLM inference?

For production LLM inference: NVIDIA H100/H200 for large models (70B+), A100 for medium models (7-70B), L4 for small models or budget deployments, and B200/B300 (Blackwell) for next-gen workloads. Use our GPU Selection Advisor at inferenceengineering.tech/exercises/gpu-advisor.

Chapter 3: Hardware | Inference Engineering

Most inference engineers use cloud GPUs. Large enterprises and governments run on-premise deployments, but cloud-based GPUs offer the flexibility and access that fast-growing AI products need to scale.

Even so, navigating the hardware landscape is complex. From variations among cloud providers to NVIDIA's naming conventions, there are many nuances in selecting the right accelerator.

GPU Architecture#

GPUs are throughput machines. Where CPUs are great at complex sequential execution, GPUs are designed for simple, massively parallel workloads. Given that AI model inference is a series of vector and matrix multiplications, GPUs are a natural fit.

📖GPU Compute Units

GPUs have Streaming Multiprocessors (SMs), and each SM contains multiple cores of three types:

CUDA Core: Operates on individual numbers (scalars)
Tensor Core: Operates on vectors and matrices
Special Function Unit (SFU): Accelerates operations like sin, cos, log

GPU Architecture Explorer

Click any region of the H100 GPU to learn about its components and specs.

When measuring GPU compute for inference, focus on Tensor Core compute — they're responsible for Matrix Multiply and Accumulate (MMA) instructions that are foundational to inference.

Compute is measured in FLOPS (floating point operations per second). When reading spec sheets, note two Tensor Core measurements:

Dense: Raw floating-point operations if every element is used
Sparse: In tensors with 2:4 structured sparsity, Tensor Cores can skip multiplication by 0

By default, inference is dense, so ensure you're looking at FLOPS without sparsity. FLOPS generally double with each halving of precision.

Key Takeaway

Compute is the bottleneck for LLM prefill and for image/video generation. If selecting hardware for these use cases, pick the accelerator with more FLOPS.

Memory and Caches

GPUs contain high-speed onboard memory called VRAM — added as HBM3, HBM3e, or HBM4 (high-bandwidth memory). GPUs have dozens or hundreds of gigabytes of VRAM.

Two types of memory on any chip:

DRAM (Dynamic RAM): General-purpose off-chip memory (gigabytes) — VRAM is a type of DRAM
SRAM (Static RAM): Faster, more expensive, on-chip memory (kilobytes/megabytes)

GPUs have three levels of cache:

Cache	Scope	Example (H100)
L0	Instruction cache per Tensor Core	—
L1	Shared memory per SM	256 KB per SM
L2	Global cache across all SMs	50 MB total

VRAM bandwidth measures the peak transfer rate between GPU cores and VRAM. This determines how quickly VRAM can supply data into the cache hierarchy.

The total VRAM limits the model size you can load. VRAM should hold the model weights plus at least 50 percent headroom for KV cache (more for long context, high batch sizes, or video generation models).

⚠️OOM Errors

If there isn't enough VRAM for the weights, loading the model will fail with an OOM (out of memory) error. If there isn't enough headroom, inference will be slow or crash.

Memory bandwidth is the bottleneck for LLM decode at low to medium batch sizes. If you want more tokens per second, select the accelerator with higher memory bandwidth (e.g., H200 instead of H100).

GPU Architecture Generations#

Every one to two years, NVIDIA releases a new GPU architecture. GPU names like B200 have two parts:

Letter: Signifies the architecture generation (e.g., B = Blackwell)
Number: Identifies models within the generation (bigger = more powerful)

Since 1998, NVIDIA has named architectures for prominent scientists. Inference engineers generally work within the three to five most recent generations.

Hopper GPUs

GPU	FP8 Compute (dense)	Memory	Bandwidth
H100	1,979 teraFLOPS	80 GB	3.35 TB/s
H200	1,979 teraFLOPS	141 GB	4.8 TB/s

Named for Rear Admiral Grace Hopper. First released March 2022. Key features:

FP8 support: FP8 Tensor Cores are 2x faster than FP16 and move data in half the bandwidth
Fourth-generation Tensor Cores with more and faster SMs
FlashAttention 3 takes advantage of Hopper's asynchronous data transfer
The most widely used inference accelerators with industry-wide support and highly optimized kernels

Ada Lovelace GPUs

GPU	FP8 Compute (dense)	Memory	Bandwidth
L4	242 teraFLOPS	24 GB	300 GB/s
L40	362 teraFLOPS	48 GB	864 GB/s

Named for the first computer programmer. More graphics-oriented than Hopper. Does not support NVLink interconnect — a major limitation for multi-GPU parallelism.

L4 GPUs are a cheap way to run small models (text embeddings, computer vision). L40 GPUs generally aren't a great choice for inference — multi-instance GPUs offer better value.

Blackwell GPUs

GPU	FP8 Compute (dense)	Memory	Bandwidth
B200	~5 petaFLOPS	192 GB	Up to 8 TB/s
B300	~5 petaFLOPS	288 GB	Up to 8 TB/s

Named for mathematician David Blackwell. First released November 2024. Key features:

FP4 support plus microscaling formats (MXFP8, MXFP4, NVFP4) for better quality retention
Updated asynchronous programming features for tiling loads and writes
FlashAttention 4 relies on asynchronous pipelines
The new gold standard for inference — highest performance for LLMs and video generation

GPU Comparison Tool

Compare specs across NVIDIA datacenter GPUs. Click column headers to sort.

Filter:

Visualize:

GPU↕	FP8 (TFLOPS)↓	FP16 (TFLOPS)↕	Memory (GB)↕	BW (TB/s)↕	Ops:Byte↕
B200 Blackwell	4,500	2,250	192	8	281
B300 Blackwell	4,500	2,250	288	8	281
H100 SXM Hopper	1,979	989	80	3.35	295
H200 SXM Hopper	1,979	989	141	4.8	206
RTX 4090 Ada Lovelace	660	330	24	1.008	327
L40 Ada Lovelace	362	181	48	0.864	209
L4 Ada Lovelace	242	121	24	0.3	403
A100 SXM Ampere	—	312	80	2	156

FP8 Compute (teraFLOPS)

B200

4,500

B300

4,500

H100 SXM

1,979

H200 SXM

1,979

RTX 4090

660

L40

362

242

A100 SXM

Rubin GPUs

The Rubin architecture (named for astronomer Vera Rubin) will launch in 2026 with:

HBM4 for higher memory bandwidth — benefits memory-bound tasks like LLM decode
CPX: A separate chip built for compute-bound tasks like LLM prefill
Part of NVIDIA's rack-scale systems for high-volume inference

After Rubin, NVIDIA will release Feynman in 2028.

Grace and Vera CPUs

NVIDIA offers ARM-based CPUs integrated with their GPUs on superchips like the GH200 and GB200. Grace CPUs use NVLink Chip to Chip for up to 900 GB/s bi-directional bandwidth between CPU and GPU memory — several times faster than PCIe.

This matters for inference setups that offload LoRA weights and KV caches from previous calls to CPU memory. For the Rubin architecture, the Vera CPU replaces Grace.

Instances#

The atomic unit of GPU allocation on the cloud is an instance — a virtual machine including:

GPUs (device): One or more GPUs for inference
CPUs (host): General-purpose compute
Memory: RAM for CPU operations
Storage: Disk for loading and storing files
Networking: Physical network connections
Interconnect: GPU-to-GPU and node-to-node connections

Key Takeaway

When provisioning instances, understand exactly what you're getting. Any component — not just the GPU — could present a bottleneck or failure point during inference.

Try it: VRAM Calculator →

Calculate memory requirements for any model on any GPU.

Try it: GPU Selection Advisor →

Answer questions about your workload to get ranked GPU recommendations.

Multi-GPU Instances

When a model is too big for a single GPU, or when multiple GPUs improve performance, the standard unit is a node containing eight GPUs connected via:

NVLink: One-to-one communication between GPUs (up to 1800 GB/s on Blackwell, 900 GB/s on Hopper)
NVSwitch: All-to-all communication on top of NVLink for coordination among all GPUs in a node

For more than eight GPUs, InfiniBand provides node-to-node interconnect (up to 400 Gb/s per NIC). NVIDIA's NVL72 GB200 system combines 72 Blackwell GPUs and 36 Grace CPUs on a full rack.

When working with multi-GPU systems, keep relative bandwidths in mind — NVLink is an order of magnitude faster than InfiniBand.

Multi-Instance GPUs

Sometimes the GPU is too big for the model. Multi-instance GPU (MIG) is a hardware-level capability in A100, H100, H200, and B200 GPUs — they can be split into as many as seven pieces, each receiving a slice of compute, memory, CPU, and networking.

For example, an H100 MIG with three slices has about 3/7 of available compute and up to half of total VRAM (40 GB). For small models like Orpheus TTS (3B parameters), two MIG instances can be more efficient than a full GPU.

Other Datacenter Accelerators#

NVIDIA is far from the only company making inference hardware. Notable alternatives:

Company	Product	Edge
AMD	MI350 GPU	Competitive specs on AMD's own software stack
AWS	Inferentia / Trainium	Purpose-built chips on AWS
Cerebras	WSE-3	Wafer-scale chip with ultra-high memory bandwidth
Etched	Sohu	ASIC for the transformer architecture
Google	TPU	AI-specific ASIC for inference and training
Groq	LPU	SRAM-based for high memory bandwidth
Qualcomm	Cloud AI 100 Ultra	Power-efficient mobile GPUs

Every competitor is betting on a specific edge: memory bandwidth (Cerebras, Groq), power efficiency (Furiosa, Qualcomm), or platform integration (Amazon, Google).

Shared challenges: rebuilding the software stack without CUDA, manufacturing complexity, and distribution.

Local Inference#

Local inference (edge/client-side/on-device inference) means running models directly on the end user's device.

Advantages: Zero network latency, independence from internet, improved privacy, free for the developer.

Limitations: Weaker hardware capabilities, thermal constraints, fragmented support matrix, battery drain.

Desktop Inference

Increasingly, Apple is the leader in desktop inference. Apple's M-series CPUs and GPUs draw from a single unified memory pool, which avoids the bandwidth bottleneck of GPU VRAM.

The M4 Max has 128 GB of unified memory — enough to run models up to 70B parameters — while MacBook Pros and Mac Studios are consumer-priced. Projects like llama.cpp and MLX have optimized LLM inference on Apple Silicon.

Mobile Inference

Mobile inference runs smaller models (1-3B parameters) on smartphones. Apple Intelligence uses on-device models for tasks like Smart Reply and text summarization. Qualcomm and MediaTek chips support inference through their NPUs.

Key Takeaway

Local inference is turning the corner from experimentation to production, with a strong ecosystem across hardware and software and a vibrant community closely affiliated with the world of open models.

Check Your Understanding

1 / 12

Which type of GPU core is most important for inference?