Hardware

GPU Inference Explained

8 min readUpdated 2026-06-01

💡This is the quick GPU inference guide

This article explains the key concepts and gives a practical GPU selection framework. For the full deep-dive on GPU architecture generations, VRAM, interconnects, and cloud accelerator options, read Chapter 3: GPU Hardware & Accelerators.

TL;DR

GPU inference is split into two fundamentally different phases: prefill (compute-bound, parallel, fast) and decode (memory-bandwidth-bound, sequential, the dominant cost). For most production chat workloads, memory bandwidth — not FLOP count — determines tokens-per-second, which is why an H200 outperforms an H100 despite similar compute specs. Choosing the right GPU requires understanding the roofline model, sizing VRAM for weights plus KV cache, and matching the hardware to whether your workload is latency-sensitive (low batch, bandwidth-critical) or throughput-oriented (large batch, compute-critical).

Key Facts

Decode phase bottleneck: Memory bandwidth (not FLOPS) — re-reads full weights each token
H100 SXM memory bandwidth: 3.35 TB/s HBM3 vs A100's 2.0 TB/s HBM2e
7B FP16 model VRAM floor: ~14 GB weights + KV cache headroom needed on top
Arithmetic intensity crossover: ~300 FLOP/byte separates compute-bound from bandwidth-bound
Batching effect on decode: Batch size 1→32 roughly 30× throughput gain before compute-saturation
L4 vs H100 price efficiency: L4 ~$0.50/hr vs H100 ~$4–8/hr; L4 wins for small models under light load

GPUs power almost all modern LLM inference, but understanding why — and more importantly, where the real bottleneck lies — is essential before you make hardware or optimization decisions. The mental model most engineers have (GPUs are fast because FLOPS) is incomplete. The accurate model is more nuanced, and getting it right changes every decision downstream.

Why GPUs for matrix math#

A CPU is designed for complex sequential control flow: branch prediction, out-of-order execution, a deep cache hierarchy. It has a small number of powerful cores (8–128) optimized for serial latency. A GPU inverts all of this: thousands of simpler cores (6,912 CUDA cores on an H100 SXM, plus 528 Tensor Core units) designed for parallel throughput. Each Tensor Core executes a 16×16 matrix-multiply-accumulate (MMA) in a single clock cycle.

Transformer inference is almost entirely matrix multiplication:

Each attention projection (Q, K, V, O) is a [batch × seq_len × hidden_dim] × [hidden_dim × hidden_dim] matmul
Each feed-forward layer is two matmuls (up projection and down projection, or up/gate/down for SwiGLU)
For a 70B Llama-class model, >95% of FLOPs are in these linear layers

This is exactly the shape of problem Tensor Cores are designed for. H100 SXM Tensor Cores deliver 989 TFLOPS at FP16 — roughly 30× faster than CPU vector units for the same operation. The software overhead of moving data between CPU and GPU for a single inference call would dominate, so for production serving everything stays on-GPU.

The two phases: prefill and decode#

Every LLM request executes in two phases with completely different performance characteristics. Understanding this split is the most important concept in GPU inference engineering.

Prefill: compute-bound

Prefill processes the entire input prompt in a single forward pass. If your prompt is 1,024 tokens, all 1,024 token representations flow through the network simultaneously as a matrix. The weight matrices are loaded once from VRAM, and the matmul runs against a large input matrix. This gives high arithmetic intensity — many FLOPs per byte of weight data loaded.

For a 70B FP16 model processing a 1,024-token batch:

Total FLOPs: ~2 × 70B × 1,024 ≈ 143 TFLOPS (single forward pass)
Weight data read: 140 GB (full model)
Arithmetic intensity: ~1,020 FLOP/byte

At this intensity, you're firmly in compute-bound territory — Tensor Cores are the bottleneck, not memory bandwidth. Prefill throughput scales with FLOPS. This phase determines time to first token (TTFT).

Decode: memory-bandwidth-bound

Decode generates output tokens one at a time. For each new token, the GPU must:

Read the full set of model weights from VRAM (140 GB for 70B FP16)
Run a single-vector × matrix multiply for each layer (batch of 1 token)
Sample the output distribution

The problem: a single token's activations form a [1 × hidden_dim] vector, not a matrix. The matmul of [1 × hidden_dim] × [hidden_dim × hidden_dim] has arithmetic intensity ≈ 1 FLOP/byte — you do one multiply-add per byte of weight loaded. The Tensor Cores are almost idle; the memory system is the bottleneck.

For a 70B FP16 model generating one token at batch size 1:

FLOPs: ~140 billion (one forward pass, single-token)
Weight data read: 140 GB
Arithmetic intensity: ~1 FLOP/byte

An H100 SXM's peak arithmetic intensity ratio is 989 TFLOPS ÷ 3.35 TB/s ≈ 295 FLOP/byte. Your workload at 1 FLOP/byte is 295× below the compute-to-bandwidth balance point. The Tensor Cores are 99.7% idle. You're limited entirely by how fast VRAM can deliver the weights.

⚠️The counterintuitive bottleneck

For single-request or low-batch decode, the GPU's Tensor Cores are almost completely idle. You paid for 989 TFLOPS and you're using 0.3% of it. The H200's advantage over the H100 isn't compute — both are nearly idle — it's bandwidth: 4.8 TB/s vs 3.35 TB/s, a 43% bandwidth increase that translates directly to 43% more tokens per second for bandwidth-bound decode.

The roofline model: finding your bound#

The roofline model gives a precise answer to "am I compute-bound or bandwidth-bound?" for any workload:

Peak throughput = min(Peak FLOPS, Peak Bandwidth × Arithmetic Intensity)

Your workload's arithmetic intensity (AI) is FLOPs / bytes of memory traffic. If AI > machine balance (peak FLOPS / peak bandwidth), you're compute-bound. If AI < machine balance, you're bandwidth-bound.

H100 SXM machine balance: 989 TFLOPS ÷ 3.35 TB/s ≈ 295 FLOP/byte

Workload	Arithmetic Intensity	Bound on H100
Decode, batch=1	~1 FLOP/byte	Bandwidth
Decode, batch=8	~8 FLOP/byte	Bandwidth
Decode, batch=32	~32 FLOP/byte	Bandwidth
Decode, batch=256	~256 FLOP/byte	Bandwidth (barely)
Decode, batch=512	~512 FLOP/byte	Compute
Prefill, seq=512	~512 FLOP/byte	Compute
Prefill, seq=2048	~2048 FLOP/byte	Compute

This is why batching is so powerful: going from batch=1 to batch=32 shifts 32 requests' worth of tokens through the same weight-read, multiplying effective throughput 32× before you approach the compute ceiling. The GPU becomes more efficient as batch size grows — until you hit the compute roof.

Try it: Arithmetic Intensity Calculator →

Plot your exact workload on a roofline chart to see whether you're compute-bound or bandwidth-bound and how much headroom remains.

VRAM: the hard ceiling#

Bandwidth and FLOPS determine speed. VRAM capacity determines whether you can run the model at all. The GPU must hold:

Model weights

The dominant cost. For a model with P parameters:

FP32: P × 4 bytes (rarely used in inference)
BF16 / FP16: P × 2 bytes (standard baseline)
FP8: P × 1 byte (standard production target in 2026)
INT4 (AWQ/GPTQ): P × 0.5 bytes

Common reference points at FP16:

Model	Parameters	FP16 VRAM (weights)	FP8 VRAM (weights)
Llama 3.2 3B	3B	6 GB	3 GB
Llama 3.1 8B	8B	16 GB	8 GB
Llama 3.1 70B	70B	140 GB	70 GB
Llama 3.1 405B	405B	810 GB	405 GB
DeepSeek-R1 671B	671B (MoE)	~210 GB active*	~105 GB active*

*MoE models only activate a subset of experts per token; only active experts need VRAM, but all expert weights must be addressable (typically on fast storage or distributed across GPUs).

KV cache

The KV cache stores attention keys and values for every token in every layer of the network. Its size is:

KV cache (bytes) = 2 × num_layers × num_kv_heads × head_dim × seq_len × batch_size × bytes_per_element

For Llama 3.1 70B (80 layers, 8 KV heads, 128 head_dim) at FP16, serving 100 concurrent requests with 4,096 max context:

2 × 80 × 8 × 128 × 4096 × 100 × 2 bytes = 84 GB

This rivals the model weights (140 GB FP16) in VRAM consumption. At long context (32K tokens), the KV cache can exceed the model weights. This is why FP8 KV cache quantization and KV cache compression are high-value optimizations for long-context serving.

Activations and framework overhead

Runtime activations during the forward pass plus PyTorch/CUDA framework overhead typically adds 5–20% on top. Budget conservatively: a good rule of thumb is weights + KV cache + 20% overhead.

Try it: VRAM Calculator →

Calculate exact memory requirements for any model, quantization level, sequence length, and batch size.

Try it: KV Cache Sizing Calculator →

Understand how much VRAM your KV cache consumes at different concurrency and context length settings.

Tensor Cores: the hardware path for matrix math#

Modern NVIDIA GPUs have two types of compute resources:

CUDA cores — general scalar/vector arithmetic. Each runs one multiply-add per clock.
Tensor Cores — matrix MMA units. Each Warp Matrix Multiply (WMMA) instruction computes a 16×16×16 tile in a single cycle, equivalent to 4,096 multiply-adds per instruction.

For FP16 matmul, Tensor Cores are 16× more efficient per cycle than CUDA cores. In practice, inference frameworks use CUDA libraries (cuBLAS, cuDNN, FlashAttention) that automatically route matmuls to Tensor Cores. You don't write Tensor Core code directly — you write standard PyTorch matmul and the CUDA library dispatches to Tensor Cores.

The H100 adds FP8 Tensor Cores (Hopper's HMMA.1688.F32.E4M3 instructions) that run at 2× the FP16 throughput. This is why FP8 quantization on H100 is essentially free performance — you get 2× compute throughput for 0.5× memory usage, a 4× effective increase in arithmetic intensity.

GPU specs comparison for inference#

The five GPUs you'll most commonly encounter for inference workloads:

GPU	VRAM	HBM Gen	Memory BW	FP16 TFLOPS	FP8 TFLOPS	TDP	Approx $/hr (cloud)
H100 SXM 80G	80 GB HBM3	HBM3	3.35 TB/s	989	1,979	700W	$3.50–6.00
H200 SXM 141G	141 GB HBM3e	HBM3e	4.8 TB/s	989	1,979	700W	$5.00–8.00
A100 SXM 80G	80 GB HBM2e	HBM2e	2.0 TB/s	312	N/A	400W	$2.00–3.50
L4	24 GB GDDR6	GDDR6	0.30 TB/s	121	242	72W	$0.40–0.80
L40S	48 GB GDDR6	GDDR6	0.86 TB/s	362	733	350W	$1.50–2.50
B200 SXM	192 GB HBM3e	HBM3e	8.0 TB/s	~2,250	~4,500	1,000W	$8–15 (est.)

Key observations:

H200 vs H100: Same compute, +43% bandwidth, +76% VRAM — a pure inference win, especially for large models and long context
A100 vs H100: A100's bandwidth (2.0 TB/s) is only 60% of H100's — you're leaving 40% of decode tokens/sec on the table
L4: Low bandwidth (300 GB/s) but excellent power efficiency (72W). Best for 7B–13B models at low concurrency where cost-per-unit-compute matters more than throughput ceiling
B200: Roughly 2.4× H100 bandwidth and VRAM — designed for 405B+ models and next-generation frontier serving

Batching effects on effective throughput#

Understanding how batch size interacts with the roofline is critical for capacity planning:

Batch size	Regime	Effective token rate (H100, 70B FP16)	Notes
1	Deep bandwidth-bound	~240 tok/s	3.35 TB/s ÷ 140 GB ≈ 24 weight passes/s × 10 tok/pass
8	Bandwidth-bound	~1,900 tok/s	Near-linear scaling with batch
32	Bandwidth-bound	~7,500 tok/s	Still scaling, approaching balance point
128	Mixed	~18,000 tok/s	Compute starts binding
256	Compute-bound	~28,000 tok/s	Diminishing returns

Numbers are approximate for continuous batching with FP8, representing total output tokens/sec across all requests in the batch.

This progression explains why continuous (in-flight) batching is the single highest-leverage inference optimization: it keeps the GPU in the high-efficiency portion of the roofline by ensuring the batch size stays large even when individual requests have different lengths.

Picking a GPU: decision heuristic#

Work through these questions in order:

Step 1: Can the model fit? Calculate model_bytes × 1.2 (20% overhead buffer). If that exceeds a single GPU's VRAM, you need tensor parallelism (multiple GPUs). Check whether your preferred engine handles TP for your model.

Step 2: What's your primary workload type?

Chat / interactive generation (decode-heavy, latency-sensitive): maximize memory bandwidth per dollar. H200 > H100 > A100 >> L40S for decode speed.
Long prompt processing / embeddings (prefill-heavy): maximize FLOPS per dollar. H100 and A100 are closer here.
Small model, cost-sensitive: L4 or L40S can be 5–10× more cost-effective than H100 for models ≤13B.

Step 3: What's your target SLO? If you need P99 TTFT < 200ms, you need enough prefill throughput that even your longest prompts finish fast. If you need P50 output token latency < 50ms, you need enough bandwidth that single-user decode runs fast.

Step 4: Multi-GPU or scale-out? For models that need multi-GPU (70B+ FP16, 405B+ FP8), NVLink bandwidth matters. H100/H200 SXM with NVLink 4.0 provides 900 GB/s bidirectional between GPU pairs — PCIe-connected GPUs are limited to ~64 GB/s and will bottleneck tensor parallelism severely.

📖NVLink vs PCIe for tensor parallelism

Tensor parallelism requires an all-reduce synchronization after every attention and feed-forward layer — for a 32-layer model that's 64 all-reduce operations per forward pass. On PCIe (64 GB/s), these synchronization costs dominate at TP≥4. On NVLink (900 GB/s), they're negligible. Always use SXM form factor (NVLink) for TP≥2 if latency matters.

Bottom line#

GPU inference performance splits cleanly into two regimes. Prefill is compute-bound: throw FLOPS at it with large batches and fast Tensor Cores. Decode is memory-bandwidth-bound: the dominant cost is moving weights from VRAM, and everything else is secondary.

For the majority of production LLM workloads — interactive chat, API serving, agent loops — you are in the decode-dominated regime. This means:

Quantize aggressively (FP8 or INT4) to reduce bytes per weight and thus memory traffic per token
Size VRAM for weights + KV cache with 20% headroom, not just for weights alone
Prefer higher-bandwidth GPUs (H200 over H100, H100 over A100) when decode speed matters
Use L4/L40S for cost-efficiency on small models at moderate concurrency

Key Takeaway

Decode dominates production LLM serving and is memory-bandwidth-bound, not compute-bound. The practical implication: an H200 generates 43% more tokens/second than an H100 for identical workloads purely because of bandwidth, not more Tensor Cores. Size your VRAM for weights plus KV cache, not just weights. Quantize first — halving bytes per weight doubles your effective decode bandwidth utilization.

Frequently asked questions

Why is GPU memory bandwidth more important than compute for inference?

During decode, an LLM generates one token at a time and must read the entire model's weights from VRAM for every single token. This makes decode memory-bandwidth-bound: the GPU spends most of its time waiting on memory, not computing. Prefill (processing the prompt) is compute-bound, but decode dominates most interactive workloads.

How much GPU memory do I need for inference?

You need VRAM for model weights (parameters × bytes per parameter) plus the KV cache (which grows with sequence length and batch size) plus ~10-20% overhead. Use our VRAM Calculator to get an exact figure for your model.

Can I run LLM inference without a GPU?

Yes, but it's much slower. CPUs lack the parallel throughput and memory bandwidth GPUs provide. CPU inference is viable for small models or low-throughput use cases; for production LLM serving, GPUs (or other accelerators like TPUs) are standard.

Keep learning

Chapter 3: GPU Hardware & Accelerators→VRAM Calculator→Arithmetic Intensity Calculator→