Most inference engineers use cloud GPUs. Large enterprises and governments run on-premise deployments, but cloud-based GPUs offer the flexibility and access that fast-growing AI products need to scale.
Even so, navigating the hardware landscape is complex. From variations among cloud providers to NVIDIA's naming conventions, there are many nuances in selecting the right accelerator.
GPU Architecture#
GPUs are throughput machines. Where CPUs are great at complex sequential execution, GPUs are designed for simple, massively parallel workloads. Given that AI model inference is a series of vector and matrix multiplications, GPUs are a natural fit.
GPUs have Streaming Multiprocessors (SMs), and each SM contains multiple cores of three types:
- CUDA Core: Operates on individual numbers (scalars)
- Tensor Core: Operates on vectors and matrices
- Special Function Unit (SFU): Accelerates operations like sin, cos, log
When measuring GPU compute for inference, focus on Tensor Core compute — they're responsible for Matrix Multiply and Accumulate (MMA) instructions that are foundational to inference.
Compute is measured in FLOPS (floating point operations per second). When reading spec sheets, note two Tensor Core measurements:
- Dense: Raw floating-point operations if every element is used
- Sparse: In tensors with 2:4 structured sparsity, Tensor Cores can skip multiplication by 0
By default, inference is dense, so ensure you're looking at FLOPS without sparsity. FLOPS generally double with each halving of precision.
Key Takeaway
Compute is the bottleneck for LLM prefill and for image/video generation. If selecting hardware for these use cases, pick the accelerator with more FLOPS.
Memory and Caches
GPUs contain high-speed onboard memory called VRAM — added as HBM3, HBM3e, or HBM4 (high-bandwidth memory). GPUs have dozens or hundreds of gigabytes of VRAM.
Two types of memory on any chip:
- DRAM (Dynamic RAM): General-purpose off-chip memory (gigabytes) — VRAM is a type of DRAM
- SRAM (Static RAM): Faster, more expensive, on-chip memory (kilobytes/megabytes)
GPUs have three levels of cache:
| Cache | Scope | Example (H100) |
|---|---|---|
| L0 | Instruction cache per Tensor Core | — |
| L1 | Shared memory per SM | 256 KB per SM |
| L2 | Global cache across all SMs | 50 MB total |
VRAM bandwidth measures the peak transfer rate between GPU cores and VRAM. This determines how quickly VRAM can supply data into the cache hierarchy.
The total VRAM limits the model size you can load. VRAM should hold the model weights plus at least 50 percent headroom for KV cache (more for long context, high batch sizes, or video generation models).
If there isn't enough VRAM for the weights, loading the model will fail with an OOM (out of memory) error. If there isn't enough headroom, inference will be slow or crash.
Memory bandwidth is the bottleneck for LLM decode at low to medium batch sizes. If you want more tokens per second, select the accelerator with higher memory bandwidth (e.g., H200 instead of H100).
GPU Architecture Generations#
Every one to two years, NVIDIA releases a new GPU architecture. GPU names like B200 have two parts:
- Letter: Signifies the architecture generation (e.g., B = Blackwell)
- Number: Identifies models within the generation (bigger = more powerful)
Since 1998, NVIDIA has named architectures for prominent scientists. Inference engineers generally work within the three to five most recent generations.
Hopper GPUs
| GPU | FP8 Compute (dense) | Memory | Bandwidth |
|---|---|---|---|
| H100 | 1,979 teraFLOPS | 80 GB | 3.35 TB/s |
| H200 | 1,979 teraFLOPS | 141 GB | 4.8 TB/s |
Named for Rear Admiral Grace Hopper. First released March 2022. Key features:
- FP8 support: FP8 Tensor Cores are 2x faster than FP16 and move data in half the bandwidth
- Fourth-generation Tensor Cores with more and faster SMs
- FlashAttention 3 takes advantage of Hopper's asynchronous data transfer
- The most widely used inference accelerators with industry-wide support and highly optimized kernels
Ada Lovelace GPUs
| GPU | FP8 Compute (dense) | Memory | Bandwidth |
|---|---|---|---|
| L4 | 242 teraFLOPS | 24 GB | 300 GB/s |
| L40 | 362 teraFLOPS | 48 GB | 864 GB/s |
Named for the first computer programmer. More graphics-oriented than Hopper. Does not support NVLink interconnect — a major limitation for multi-GPU parallelism.
L4 GPUs are a cheap way to run small models (text embeddings, computer vision). L40 GPUs generally aren't a great choice for inference — multi-instance GPUs offer better value.
Blackwell GPUs
| GPU | FP8 Compute (dense) | Memory | Bandwidth |
|---|---|---|---|
| B200 | ~5 petaFLOPS | 192 GB | Up to 8 TB/s |
| B300 | ~5 petaFLOPS | 288 GB | Up to 8 TB/s |
Named for mathematician David Blackwell. First released November 2024. Key features:
- FP4 support plus microscaling formats (MXFP8, MXFP4, NVFP4) for better quality retention
- Updated asynchronous programming features for tiling loads and writes
- FlashAttention 4 relies on asynchronous pipelines
- The new gold standard for inference — highest performance for LLMs and video generation
Rubin GPUs
The Rubin architecture (named for astronomer Vera Rubin) will launch in 2026 with:
- HBM4 for higher memory bandwidth — benefits memory-bound tasks like LLM decode
- CPX: A separate chip built for compute-bound tasks like LLM prefill
- Part of NVIDIA's rack-scale systems for high-volume inference
After Rubin, NVIDIA will release Feynman in 2028.
Grace and Vera CPUs
NVIDIA offers ARM-based CPUs integrated with their GPUs on superchips like the GH200 and GB200. Grace CPUs use NVLink Chip to Chip for up to 900 GB/s bi-directional bandwidth between CPU and GPU memory — several times faster than PCIe.
This matters for inference setups that offload LoRA weights and KV caches from previous calls to CPU memory. For the Rubin architecture, the Vera CPU replaces Grace.
Instances#
The atomic unit of GPU allocation on the cloud is an instance — a virtual machine including:
- GPUs (device): One or more GPUs for inference
- CPUs (host): General-purpose compute
- Memory: RAM for CPU operations
- Storage: Disk for loading and storing files
- Networking: Physical network connections
- Interconnect: GPU-to-GPU and node-to-node connections
Key Takeaway
When provisioning instances, understand exactly what you're getting. Any component — not just the GPU — could present a bottleneck or failure point during inference.
Try it: VRAM Calculator →
Calculate memory requirements for any model on any GPU.
Try it: GPU Selection Advisor →
Answer questions about your workload to get ranked GPU recommendations.
Multi-GPU Instances
When a model is too big for a single GPU, or when multiple GPUs improve performance, the standard unit is a node containing eight GPUs connected via:
- NVLink: One-to-one communication between GPUs (up to 1800 GB/s on Blackwell, 900 GB/s on Hopper)
- NVSwitch: All-to-all communication on top of NVLink for coordination among all GPUs in a node
For more than eight GPUs, InfiniBand provides node-to-node interconnect (up to 400 Gb/s per NIC). NVIDIA's NVL72 GB200 system combines 72 Blackwell GPUs and 36 Grace CPUs on a full rack.
When working with multi-GPU systems, keep relative bandwidths in mind — NVLink is an order of magnitude faster than InfiniBand.
Multi-Instance GPUs
Sometimes the GPU is too big for the model. Multi-instance GPU (MIG) is a hardware-level capability in A100, H100, H200, and B200 GPUs — they can be split into as many as seven pieces, each receiving a slice of compute, memory, CPU, and networking.
For example, an H100 MIG with three slices has about 3/7 of available compute and up to half of total VRAM (40 GB). For small models like Orpheus TTS (3B parameters), two MIG instances can be more efficient than a full GPU.
Other Datacenter Accelerators#
NVIDIA is far from the only company making inference hardware. Notable alternatives:
| Company | Product | Edge |
|---|---|---|
| AMD | MI350 GPU | Competitive specs on AMD's own software stack |
| AWS | Inferentia / Trainium | Purpose-built chips on AWS |
| Cerebras | WSE-3 | Wafer-scale chip with ultra-high memory bandwidth |
| Etched | Sohu | ASIC for the transformer architecture |
| TPU | AI-specific ASIC for inference and training | |
| Groq | LPU | SRAM-based for high memory bandwidth |
| Qualcomm | Cloud AI 100 Ultra | Power-efficient mobile GPUs |
Every competitor is betting on a specific edge: memory bandwidth (Cerebras, Groq), power efficiency (Furiosa, Qualcomm), or platform integration (Amazon, Google).
Shared challenges: rebuilding the software stack without CUDA, manufacturing complexity, and distribution.
Local Inference#
Local inference (edge/client-side/on-device inference) means running models directly on the end user's device.
Advantages: Zero network latency, independence from internet, improved privacy, free for the developer.
Limitations: Weaker hardware capabilities, thermal constraints, fragmented support matrix, battery drain.
Desktop Inference
Increasingly, Apple is the leader in desktop inference. Apple's M-series CPUs and GPUs draw from a single unified memory pool, which avoids the bandwidth bottleneck of GPU VRAM.
The M4 Max has 128 GB of unified memory — enough to run models up to 70B parameters — while MacBook Pros and Mac Studios are consumer-priced. Projects like llama.cpp and MLX have optimized LLM inference on Apple Silicon.
Mobile Inference
Mobile inference runs smaller models (1-3B parameters) on smartphones. Apple Intelligence uses on-device models for tasks like Smart Reply and text summarization. Qualcomm and MediaTek chips support inference through their NPUs.
Key Takeaway
Local inference is turning the corner from experimentation to production, with a strong ecosystem across hardware and software and a vibrant community closely affiliated with the world of open models.
Check Your Understanding
1 / 12Which type of GPU core is most important for inference?