Hardware

AI Inference Hardware Guide

10 min readUpdated 2026-06-01

💡This is the quick hardware comparison guide

This article maps the hardware landscape and gives a buying framework. For the complete deep-dive on GPU architecture, Tensor Cores, HBM generations, NVLink, and PCIe topology, read Chapter 3: GPU Hardware & Accelerators.

TL;DR

AI inference hardware in 2026 spans NVIDIA Hopper and Blackwell datacenter GPUs, AMD MI300X, Google TPU v5e/v6, AWS Inferentia/Trainium, and Intel Gaudi 3. For LLM inference, memory bandwidth and VRAM capacity dominate the selection decision far more than FLOP count — autoregressive decode re-reads the full model from memory for every token, making bandwidth the binding constraint. NVIDIA H100/H200 remain the production default; AMD MI300X is the best alternative for large-VRAM needs at lower cost; dedicated accelerators (AWS Inferentia, Groq LPU) win only on specific latency-or-cost profiles with stable model workloads.

Key Facts

Best LLM inference bandwidth: B200 SXM: 8.0 TB/s HBM3e (2.4× H100)
Best VRAM capacity (single chip): MI300X: 192 GB HBM3 (vs H100's 80 GB)
Best cost per VRAM GB: AMD MI300X at ~$3–5/hr for 192 GB vs H100 at $3.50–6/hr for 80 GB
TPU v5e advantage: Best price/performance for JAX-based workloads on Google Cloud
AWS Inferentia2 sweet spot: Stable, high-volume Llama/BERT inference at 40–60% lower cost-per-token than H100
NVLink 4.0 vs PCIe bandwidth: 900 GB/s bidirectional NVLink vs 64 GB/s PCIe — critical for TP≥2

AI inference hardware has proliferated well beyond the simple GPU-or-not decision. In 2026, a production team choosing hardware for LLM serving must evaluate NVIDIA's three active GPU families, AMD's MI300X, Google's TPU lineage, AWS's custom silicon, Intel's Gaudi, and purpose-built low-latency chips like Groq's LPU. Each has genuine strengths, but the decision space is cleaner than it appears once you apply the right evaluation criteria.

The single most important evaluation criterion#

Before surveying the hardware, fix the mental model. For LLM inference:

During decode, the GPU re-reads the entire model from memory for every token generated. A 70B FP16 model is 140 GB. Each output token requires reading all 140 GB from VRAM through the memory subsystem. If you generate 10 tokens per second from a single user's request, you're reading 1.4 TB/s of weight data per user. The compute units (Tensor Cores, matrix engines, TPU systolic arrays) are almost entirely idle — bounded at ~1 FLOP per byte read.

This means for decode-heavy workloads (chat, agents, code completion):

Memory bandwidth determines tokens/second per chip
VRAM capacity determines the largest model you can serve
FLOP count is nearly irrelevant for single-user latency (it matters for prefill and for large batches where batch size × arithmetic intensity approaches the compute ceiling)

Any hardware selection that ranks chips by FLOPS alone is wrong for inference. Rank by bandwidth/$ and VRAM/$ first.

NVIDIA Hopper: H100 and H200#

The H100 SXM launched in 2022 and remains the most-deployed LLM inference chip in 2026. Its defining characteristics for inference:

H100 SXM 80G

80 GB HBM3 — sufficient for 7B–70B FP8 models on a single chip; 70B FP16 requires 2× H100
3.35 TB/s bandwidth — the baseline everyone benchmarks against
989 TFLOPS FP16 / 1,979 TFLOPS FP8 — FP8 Tensor Cores (new in Hopper) run 2× FP16 rate
NVLink 4.0 — 900 GB/s bidirectional between GPUs in SXM form factor; critical for TP≥2
4th-gen Transformer Engine — automatic FP8/FP16 mixed precision based on per-layer dynamic scaling
~$3.50–6.00/hr on major cloud providers (A10, Lambda, CoreWeave, GCP, AWS)

The H100 is the safe default. The software ecosystem (vLLM, SGLang, TensorRT-LLM, NeMo, PyTorch, JAX) is most mature against H100. Any optimization technique you read about was probably benchmarked on H100.

H200 SXM 141G

The H200 is the H100 with upgraded memory: same GPU die, same Tensor Cores, same compute TFLOPS — but with HBM3e memory expanding VRAM from 80 GB to 141 GB and bandwidth from 3.35 TB/s to 4.8 TB/s (+43%).

For inference, the implications are concrete:

70B FP16 model fits in a single H200 (140 GB, barely) vs requiring 2× H100
43% more decode tokens/second for bandwidth-bound workloads (which is most of them)
Same prefill speed as H100 — same FLOPS

If you're currently running 2× H100 for a 70B FP16 model and can move to FP8 (70 GB), you fit on 1× H100. If you need FP16 precision and 70B scale, H200 is the right chip. At ~$5–8/hr vs H100's $3.50–6/hr, the per-token cost often favors H200 once you account for needing 2× H100 for FP16 70B.

NVIDIA Blackwell: B200 and GB200#

Blackwell (announced late 2024, shipping 2025–2026) represents a step-function improvement:

B200 SXM

192 GB HBM3e — enough for a 70B FP16 model with substantial KV cache headroom, or a 405B FP8 model on 2× chips
8.0 TB/s bandwidth — 2.4× H100's bandwidth, more than 1.6× H200
~2,250 TFLOPS FP16 / ~4,500 TFLOPS FP8 — significant compute increase
FP4 Tensor Cores — new in Blackwell; FP4 quantization with minimal quality loss for select model families
~$8–15/hr (estimated; pricing stabilizing through 2026)

For Llama 3.1 70B FP16: a single B200 can serve it with 52 GB remaining for KV cache (70B × 2 bytes = 140 GB, 192 GB − 140 GB = 52 GB). No tensor parallelism needed. This simplifies deployment substantially.

GB200 NVL72

The GB200 NVL72 is a rack-scale system: 36 Grace CPUs + 72 B200 GPUs connected via NVLink Switch with 1.8 TB/s all-to-all bandwidth across all GPUs. At this scale, a single DeepSeek-R1 671B model with expert parallelism can route tokens across all 72 GPUs with low latency. This is the target platform for frontier model serving at scale.

📖Why NVLink Switch changes the equation for MoE

Mixture-of-experts models route each token to a subset of expert modules. If experts are distributed across GPUs, token routing requires inter-GPU communication on every forward pass. NVLink Switch provides all-to-all bandwidth between all 72 GPUs at 1.8 TB/s total, making expert parallelism nearly as fast as local memory access. PCIe-connected GPU clusters cannot viably scale to this level.

NVIDIA Ada Lovelace: L4 and L40S#

Ada Lovelace is the consumer/professional-grade generation, using GDDR6 memory rather than HBM. This is a significant difference for inference:

L4 (24 GB GDDR6)

24 GB GDDR6 — fits 7B FP16, 13B FP8
300 GB/s bandwidth — 11× less than H100; decode is proportionally slower
121 TFLOPS FP16 / 242 TFLOPS FP8
72W TDP — fits in standard PCIe slots without special power requirements
~$0.40–0.80/hr — extremely cost-effective for small models

The L4 is not for large model serving. Its value is density (8 L4s fit in a standard 1U server) and cost efficiency for 7B–13B models at moderate concurrency. For a chatbot serving hundreds of users/hour on a 7B model, 4× L4 at $2/hr competes favorably with 1× H100 at $4/hr serving similar throughput.

L40S (48 GB GDDR6)

48 GB GDDR6 — fits 13B FP16, 30B FP8
864 GB/s bandwidth — 26% of H100; still bandwidth-constrained for decode
362 TFLOPS FP16 / 733 TFLOPS FP8
350W TDP
~$1.50–2.50/hr

L40S is the mid-tier sweet spot for organizations that need more VRAM than L4 but can't justify H100 prices. It handles 30B FP8 models reasonably well and has excellent video/image generation throughput (optimized for creative workloads).

NVIDIA Ampere: A100#

The A100 is the previous flagship generation:

40 GB or 80 GB HBM2e — slower memory type than H100's HBM3
2.0 TB/s bandwidth (80GB version) — 40% less than H100
312 TFLOPS FP16 — no native FP8 Tensor Cores (Hopper introduced FP8)
400W TDP
~$2.00–3.50/hr

The A100 remains relevant because it's widely available and cheaper than H100. For workloads that are tolerant of higher latency or can use very large batches (where the compute ceiling matters more), A100 remains cost-competitive. FP8 on A100 requires emulation (two FP16 operations), so the FP8 speedup that H100 gets natively isn't available.

AMD MI300X#

AMD's MI300X (2024) is the strongest NVIDIA alternative for LLM inference:

192 GB HBM3 — matches B200's VRAM, far exceeds H100's 80 GB
5.3 TB/s bandwidth — exceeds H200 (4.8 TB/s), comparable to early Blackwell specs
1,307 TFLOPS FP16 — strong compute, behind H100 FP8 path but competitive at FP16
750W TDP
~$3–5/hr on ROCm-compatible cloud providers (AMD Cloud, Azure ND MI300X)

The MI300X's 192 GB HBM3 is its killer feature. A 70B FP16 model (140 GB) fits comfortably on a single MI300X with ~52 GB left for KV cache. On H100, you need 2× chips (or FP8 quantization to fit in 80 GB). For teams running large FP16 models that want to avoid quantization, MI300X can be dramatically cheaper than two H100s.

The catch: The ROCm software ecosystem lags CUDA. PyTorch ROCm works well for standard inference. vLLM has solid ROCm support. But TensorRT-LLM is NVIDIA-only, and many cutting-edge kernel optimizations (FlashAttention-3, Triton-optimized MoE routing) arrive on CUDA first, with ROCm ports weeks to months later. Teams that need the absolute latest optimizations will find NVIDIA's software ecosystem more complete.

Dimension	H100 SXM 80G	H200 SXM 141G	MI300X	B200 SXM
VRAM	80 GB HBM3	141 GB HBM3e	192 GB HBM3	192 GB HBM3e
Memory bandwidth	3.35 TB/s	4.8 TB/s	5.3 TB/s	8.0 TB/s
FP16 TFLOPS	989	989	1,307	~2,250
FP8 TFLOPS	1,979	1,979	~2,600*	~4,500
TDP	700W	700W	750W	1,000W
NVLINK/interconnect	NVLink 4.0	NVLink 4.0	Infinity Fabric	NVLink 5.0
Approx $/hr	$3.50–6	$5–8	$3–5	$8–15
Software ecosystem	Best (CUDA)	Best (CUDA)	Good (ROCm)	Best (CUDA)
Best for	General inference	Large models, decode	70B+ FP16 serving	Frontier models

*MI300X FP8 throughput is not officially specified; estimate based on architecture ratio.

Google TPU v5e and v6#

Google's TPUs are systolic-array accelerators designed for both training and inference on Google Cloud:

TPU v5e

16 GB HBM per chip — limited single-chip VRAM; typical inference uses pod slices (multi-chip)
393 TFLOPS BF16 per chip
Best for: JAX-based model serving at scale; competitive on Gemini and other Google-developed models
Cost-effective for batch inference when running on Google Cloud with Pathways/JAX

TPU v6 (Trillium)

Approximately 4× TPU v5e performance per chip
Pod-scale connectivity for very large model serving
Available on Google Cloud in limited regions (2025–2026)

TPU practical reality: TPUs are excellent for teams already invested in JAX (Flax, MaxText, etc.) and running on GCP. The JAX compilation model (XLA) produces highly efficient TPU code. For teams using PyTorch and HuggingFace — the large majority — TPUs require significant adaptation work. PyTorch/XLA exists but the optimization feedback loop is slower than CUDA. TPUs win on price/performance for specific workloads at Google Cloud scale; they're a worse choice for teams outside the JAX ecosystem.

AWS Inferentia and Trainium#

AWS's custom silicon for ML inference (Inferentia) and training (Trainium):

AWS Inferentia2

Available as inf2 instances: inf2.xlarge (2 chips, 32 GB) through inf2.48xlarge (12 chips, 192 GB)
Each chip: 128 GB LPDDR4 per chip in inf2, ~2 TB/s bandwidth
Achieves 40–60% lower cost-per-token than H100 for stable, high-volume Llama-family serving
Requires model compilation via AWS Neuron SDK (analogous to TensorRT compilation)
Supported models: Llama, Mistral, BERT, Whisper — growing list via Neuron SDK 2.x

AWS Trainium2

Designed for training but usable for inference; trn2.48xlarge instances
192 GB HBM, strong FP8 support

Inferentia's value proposition: For high-volume, stable serving of popular model architectures, the cost savings are real — 40–60% vs GPU instances. The tradeoff is the same as TensorRT-LLM: compilation overhead, limited architecture support, and reduced flexibility. Works best for teams running a single model variant at sustained high volume on AWS.

Intel Gaudi 3#

Intel's Gaudi 3 (2024) targets the NVIDIA alternative market:

128 GB HBM2e per accelerator
3.7 TB/s bandwidth — strong bandwidth, between H100 and H200
2× 3.2 TFLOPS BF16 — good compute
Available on Intel Developer Cloud and select partnerships
Software: Intel Optimum Habana, supports PyTorch and HuggingFace natively
Most mature Intel alternative to NVIDIA for LLM inference

Gaudi 3's 128 GB HBM2e is its main selling point over H100's 80 GB, at potentially competitive pricing. The software ecosystem is less mature than CUDA but more complete than ROCm for common HuggingFace workloads.

Groq LPU: the latency-first outlier#

Groq's Language Processing Unit (LPU) takes a completely different architectural approach: a deterministic, on-chip SRAM-based design where model weights are loaded into fast on-chip memory and execution is fully deterministic with no DRAM bottleneck during inference.

Ultra-low latency token generation — demonstrated >500 tok/s on Llama 70B, single-user
No HBM at all — model weights must fit in on-chip SRAM hierarchy
Limited model size — practical upper limit ~70B parameters with quantization
Cloud API only — not available as bare-metal hardware

Groq excels for latency-sensitive workloads where cost is secondary: real-time voice, near-instant code completion, low-latency agents. Not suitable for frontier-scale models (405B, 671B MoE) or high-throughput batch inference.

Hardware selection framework#

Work through these decision filters:

1. Model size and VRAM requirements

Calculate: model_params × bytes_per_param × 1.25 (weights + 25% overhead for KV cache + activations at your target batch size).

Model (FP8)	Min VRAM needed	Fits on
7B FP8	~9 GB	L4, L40S, any GPU
13B FP8	~17 GB	L40S, H100, A100
70B FP8	~88 GB	H100×2, H200, MI300X, B200
70B FP16	~175 GB	H200×2, MI300X, B200
405B FP8	~508 GB	H100×8, B200×3, NVL72

2. Workload type and primary optimization target

Interactive chat (decode-heavy, latency-sensitive): Rank by bandwidth/$ — H200 > H100 > MI300X > A100 > L40S
Batch inference, throughput-maximizing: Rank by FLOPS/$ at your batch size — H100/H200 with FP8 often wins
Cost-minimizing, stable model, high volume: Consider AWS Inferentia2 (40–60% savings) or L4/L40S (for small models)
Ultra-low latency, ≤70B: Groq LPU if on-demand API is acceptable

3. Multi-GPU topology requirements

If TP≥2 is needed (model too large for single GPU), NVLink is critical:

H100/H200/B200 SXM — NVLink 4.0/5.0, 900 GB/s+ per pair
H100/H200 PCIe — 64 GB/s PCIe, TP latency penalty is severe; avoid for TP≥4
MI300X — Infinity Fabric at competitive bandwidth for AMD cluster configurations

4. Software ecosystem fit

CUDA-first team (PyTorch, TRT-LLM, FlashAttention): stick with NVIDIA
GCP + JAX team: TPUs are first-class citizens
AWS-native, stable model: Inferentia2 is worth the compilation overhead
AMD: ROCm + vLLM is mature; TRT-LLM is unavailable

5. TCO over 1 year

Cloud spot/reserved pricing, reserved instance discounts, and power costs change the picture significantly for long-running deployments:

GPU	1yr reserved $/hr (est.)	1yr total cost	Suitable for
H100 SXM	$2.50–3.50	$22k–31k	Most LLM serving
H200 SXM	$3.50–5.00	$31k–44k	Large models, long context
MI300X	$2.00–3.50	$18k–31k	70B FP16 serving
A100 SXM	$1.50–2.50	$13k–22k	Budget, lower throughput
Inferentia2	~$1.00–2.00	$9k–18k	Stable model, AWS

Try it: Model-to-Hardware Recommender →

Input your model, workload type, and SLO to get ranked hardware configurations with cost-per-token estimates.

Try it: VRAM Calculator →

Calculate exact VRAM needs across model sizes, quantization levels, and concurrency — before committing to a GPU tier.

Training vs inference hardware: why they diverge#

A common mistake: assuming the best training chip is the best inference chip. The differences are structural:

Training requires:

Maximum sustained compute (long backward passes)
High-precision arithmetic (BF16/FP32 gradient accumulation)
Fast inter-chip interconnect for gradient all-reduce across many chips
Large batch sizes (thousands of sequences simultaneously)

Inference requires:

Memory bandwidth (decode is bandwidth-bound at small batches)
Low latency (single-user requests don't benefit from large batches)
Aggressive quantization (FP8, INT4 — unacceptable for training gradients)
Cost-efficiency at serving load (not training throughput)

This is why purpose-built inference chips exist and why the A100 (great for training in 2021) aged out faster for inference than for training — its bandwidth-to-FLOPS ratio wasn't optimal for the bandwidth-bound decode regime that defines LLM serving economics.

Bottom line#

NVIDIA H100/H200 is the right default for the vast majority of production LLM inference. H100 SXM for models up to 70B FP8; H200 SXM for 70B FP16 or for squeeze every decode token out of bandwidth. MI300X is the compelling alternative when 192 GB VRAM avoids multi-GPU tensor parallelism for large FP16 models and your team can manage ROCm. AWS Inferentia2 delivers real cost savings (40–60%) for stable, high-volume workloads on AWS. Groq LPU is the right call only when single-digit-millisecond token latency is a hard requirement for ≤70B models.

Key Takeaway

For LLM inference hardware selection, start with VRAM capacity (can the model fit?) then rank by memory bandwidth per dollar (which determines decode tokens/second per dollar). NVIDIA H100/H200 is the production default with the best software ecosystem; AMD MI300X is the best alternative for large FP16 models at lower VRAM cost; dedicated inference chips win only for specific cost-or-latency profiles with stable workloads. Avoid selecting hardware based solely on FLOP count — it is the least useful specification for decode-dominated inference workloads.

Frequently asked questions

What hardware is best for AI inference?

For most LLM inference, NVIDIA datacenter GPUs (H100, H200, B200) are the default due to mature software and high memory bandwidth. AMD MI300-series and Google TPUs are competitive alternatives. Dedicated inference chips (Groq, Cerebras, AWS Inferentia) can win on specific latency or cost profiles.

What's the difference between training and inference hardware?

Training needs maximum compute and high-precision arithmetic across many GPUs for long runs. Inference prioritizes memory bandwidth (for decode), low latency, and cost-efficiency, and often uses lower precision (FP8/INT8/INT4). Some chips are inference-only.

Are dedicated inference chips better than GPUs?

Sometimes. Dedicated inference silicon can deliver lower latency or better cost-per-token for specific models and workloads, but GPUs remain the most flexible and best-supported option. The right choice depends on your model, latency SLO, and scale.

Keep learning

Chapter 3: GPU Hardware & Accelerators→GPU Selection Advisor→Model-to-Hardware Recommender→