Skip to content
Cheat Sheets/Hardware: GPUs & Accelerators

Hardware: GPUs & Accelerators

4 sections · Quick reference card

GPU Specs Reference

GPUArchitectureHBM (GB)Mem BW (TB/s)FP16 TFLOPSTDP (W)
H100 SXM5Hopper803.35989700
H200 SXM5Hopper1414.8989700
B200 SXM6Blackwell1928.022501000
B300 SXM6Blackwell Ultra2888.025001000
A100 SXM4Ampere802.0312400
L4Ada Lovelace240.312172

Memory Hierarchy

LevelCapacityBandwidthLatency
Registers~256 KB/SM~100 TB/s~1 cycle
L1 / Shared Mem128–256 KB/SM~19 TB/s~20 cycles
L2 Cache50–96 MB~7 TB/s~200 cycles
HBM (VRAM)80–288 GB3–8 TB/s~600 cycles
NVLinkmulti-GPU0.9 TB/s~1 μs
PCIe 5.0CPU-GPU0.128 TB/s~5 μs

Key Hardware Concepts

Arithmetic Intensity
FLOPs / bytes moved. If AI < ops:byte ratio of GPU, workload is memory-bandwidth-bound.
Roofline model
Performance = min(peak_FLOPS, bandwidth × AI). Identifies whether kernel is compute or BW limited.
NVLink
NVIDIA's high-speed GPU interconnect. 900 GB/s (NVLink 4.0). Used for tensor parallelism.
NVSwitch
All-to-all NVLink switching fabric in DGX/HGX nodes. Enables full-bandwidth GPU mesh.
MIG
Multi-Instance GPU. Partition one H100 into up to 7 isolated GPU instances. Great for small models.

Sizing Checklist

  • Calculate model VRAM: params × bytes_per_param (FP16=2, INT8=1, FP8=1, INT4=0.5)
  • Add KV cache: 2 × layers × heads × d_head × context_len × batch × dtype_bytes
  • Add activation memory: ~1 GB overhead per GPU
  • Check NVLink topology if using tensor parallelism
  • Verify PCIe bandwidth for CPU-GPU data transfers
  • Consider MIG for small models to reduce cost