Hardware: GPUs & Accelerators
4 sections · Quick reference card
GPU Specs Reference
| GPU | Architecture | HBM (GB) | Mem BW (TB/s) | FP16 TFLOPS | TDP (W) |
|---|---|---|---|---|---|
| H100 SXM5 | Hopper | 80 | 3.35 | 989 | 700 |
| H200 SXM5 | Hopper | 141 | 4.8 | 989 | 700 |
| B200 SXM6 | Blackwell | 192 | 8.0 | 2250 | 1000 |
| B300 SXM6 | Blackwell Ultra | 288 | 8.0 | 2500 | 1000 |
| A100 SXM4 | Ampere | 80 | 2.0 | 312 | 400 |
| L4 | Ada Lovelace | 24 | 0.3 | 121 | 72 |
Memory Hierarchy
| Level | Capacity | Bandwidth | Latency |
|---|---|---|---|
| Registers | ~256 KB/SM | ~100 TB/s | ~1 cycle |
| L1 / Shared Mem | 128–256 KB/SM | ~19 TB/s | ~20 cycles |
| L2 Cache | 50–96 MB | ~7 TB/s | ~200 cycles |
| HBM (VRAM) | 80–288 GB | 3–8 TB/s | ~600 cycles |
| NVLink | multi-GPU | 0.9 TB/s | ~1 μs |
| PCIe 5.0 | CPU-GPU | 0.128 TB/s | ~5 μs |
Key Hardware Concepts
- Arithmetic Intensity
- FLOPs / bytes moved. If AI < ops:byte ratio of GPU, workload is memory-bandwidth-bound.
- Roofline model
- Performance = min(peak_FLOPS, bandwidth × AI). Identifies whether kernel is compute or BW limited.
- NVLink
- NVIDIA's high-speed GPU interconnect. 900 GB/s (NVLink 4.0). Used for tensor parallelism.
- NVSwitch
- All-to-all NVLink switching fabric in DGX/HGX nodes. Enables full-bandwidth GPU mesh.
- MIG
- Multi-Instance GPU. Partition one H100 into up to 7 isolated GPU instances. Great for small models.
Sizing Checklist
- Calculate model VRAM: params × bytes_per_param (FP16=2, INT8=1, FP8=1, INT4=0.5)
- Add KV cache: 2 × layers × heads × d_head × context_len × batch × dtype_bytes
- Add activation memory: ~1 GB overhead per GPU
- Check NVLink topology if using tensor parallelism
- Verify PCIe bandwidth for CPU-GPU data transfers
- Consider MIG for small models to reduce cost