Skip to content
Hardware

GPU Inference Explained

8 min readUpdated 2026-06-01

GPUs power almost all modern LLM inference, but why they're the right tool — and where their real bottleneck lies — is widely misunderstood. This guide explains how GPUs run inference and what that means for choosing and sizing hardware.

Why GPUs, not CPUs#

GPUs are throughput machines. Where a CPU excels at complex sequential logic, a GPU is built for simple operations run massively in parallel. Model inference is fundamentally a long sequence of matrix multiplications — exactly the workload GPUs are designed for.

A modern datacenter GPU has thousands of cores and, critically, specialized Tensor Cores that perform matrix-multiply-and-accumulate operations far faster than general-purpose cores. This is why a single H100 can serve models that would crawl on even a high-end CPU.

The two phases of LLM inference#

Every LLM request runs in two distinct phases, and they stress the GPU very differently:

Prefill processes the entire input prompt in one parallel forward pass. It's compute-bound — the GPU's Tensor Cores are the limiting factor. This phase determines your time to first token (TTFT).

Decode generates output tokens one at a time, autoregressively. For each new token, the GPU must read the entire set of model weights from memory. This makes decode memory-bandwidth-bound — the GPU spends most of its time waiting for data to arrive from VRAM, not computing. This phase determines your tokens per second (TPS).

⚠️The counterintuitive bottleneck

For most interactive LLM workloads, decode dominates — and decode is limited by memory bandwidth, not compute. This is why an H200 (higher bandwidth) often beats an H100 for token generation even though they have similar FLOPS.

Compute vs memory bandwidth#

Two GPU specs matter most for inference:

  • Compute (FLOPS) — how many floating-point operations per second the Tensor Cores can do. This caps prefill throughput.
  • Memory bandwidth (TB/s) — how fast data moves between VRAM and the cores. This caps decode throughput.

The ratio of these two — the arithmetic intensity of your workload — tells you which one you're bound by. Small-batch decode has low arithmetic intensity (you read a lot of weights to compute relatively little), so it's memory-bound. Large-batch prefill has high arithmetic intensity, so it's compute-bound.

Try it: Arithmetic Intensity Calculator

Plot your workload on a roofline to see whether you're compute-bound or memory-bound.

Memory: the hard limit#

Beyond speed, VRAM capacity sets a hard ceiling on what you can run. The GPU must hold:

  • Model weights — parameters × bytes per parameter (e.g. a 7B model in FP16 ≈ 14 GB)
  • KV cache — grows with sequence length and batch size; can rival or exceed the weights for long contexts
  • Activations and overhead — typically 10-20% on top

If the weights don't fit, the model won't load (an OOM error). If there isn't enough headroom for the KV cache, inference slows or crashes under load.

Try it: VRAM Calculator

Calculate exact memory requirements for any model on any GPU.

Picking a GPU for inference#

A practical heuristic:

  1. Will the model weights fit in VRAM, with ~50% headroom for KV cache? If not, you need a bigger GPU or multiple GPUs with parallelism.
  2. Is your workload decode-heavy (chat, generation)? Prioritize memory bandwidth (H200 > H100).
  3. Is it prefill-heavy (long prompts, embeddings, image generation)? Prioritize compute (FLOPS).
  4. Cost-sensitive and small model? An L4 or L40S may be far more economical than an H100.

Key Takeaway

GPUs win at inference because of parallel Tensor Core throughput, but the binding constraint for token generation is usually memory bandwidth, not compute. Size for VRAM capacity first, then optimize for bandwidth or compute based on whether your workload is decode- or prefill-heavy.

Frequently asked questions

Why is GPU memory bandwidth more important than compute for inference?

During decode, an LLM generates one token at a time and must read the entire model's weights from VRAM for every single token. This makes decode memory-bandwidth-bound: the GPU spends most of its time waiting on memory, not computing. Prefill (processing the prompt) is compute-bound, but decode dominates most interactive workloads.

How much GPU memory do I need for inference?

You need VRAM for model weights (parameters × bytes per parameter) plus the KV cache (which grows with sequence length and batch size) plus ~10-20% overhead. Use our VRAM Calculator to get an exact figure for your model.

Can I run LLM inference without a GPU?

Yes, but it's much slower. CPUs lack the parallel throughput and memory bandwidth GPUs provide. CPU inference is viable for small models or low-throughput use cases; for production LLM serving, GPUs (or other accelerators like TPUs) are standard.

Keep learning