Software: Stack & Inference Engines
4 sections · Quick reference card
Software Stack Layers
| Layer | What It Does | Examples |
|---|---|---|
| Application | API, routing, auth, caching | FastAPI, LiteLLM, nginx |
| Inference Engine | Batching, KV cache mgmt, scheduling | vLLM, SGLang, TRT-LLM |
| DL Framework | Tensor ops, autograd, model loading | PyTorch, JAX |
| Kernel Libraries | Optimized GPU ops | cuDNN, CUTLASS, FlashAttention |
| CUDA Runtime | GPU management, memory, streams | CUDA, ROCm, OpenCL |
| Driver / Hardware | Physical GPU execution | NVIDIA Driver, H100, A100 |
Inference Engine Comparison
| Engine | Strengths | Best For |
|---|---|---|
| vLLM | PagedAttention, OpenAI-compatible API, broad model support | General serving, research |
| SGLang | RadixAttention, structured generation, low latency | Agentic workloads, constrained gen |
| TensorRT-LLM | NVIDIA-optimized kernels, highest throughput | Production on NVIDIA hardware |
| llama.cpp | CPU+GPU, GGUF quantization, low memory | Edge, local, consumer hardware |
| MLC LLM | Multi-platform (CUDA/Metal/WebGPU) | Cross-platform deployment |
Key Concepts
- PagedAttention
- KV cache stored in non-contiguous memory pages (like OS virtual memory). Eliminates fragmentation. Enables high batch sizes. Core of vLLM.
- Continuous batching
- New requests inserted into batch as soon as a slot frees. Eliminates padding waste. Also called iteration-level scheduling.
- CUDA graphs
- Capture GPU kernel sequence as a graph, replay with minimal CPU overhead. Reduces latency for fixed batch sizes.
- FlashAttention
- IO-aware attention kernel. Fuses softmax + matmul. Avoids materializing N×N attention matrix. 2-4× faster, 5-20× less memory.
- GGUF
- File format for quantized models (successor to GGML). Used by llama.cpp. Stores model weights + metadata in one file.
Model Format Checklist
- Safetensors: prefer over .bin for security + mmap support
- FP16 baseline: most inference engines expect fp16 weights
- GGUF: use for llama.cpp and consumer deployment
- TensorRT engine: pre-compile for fixed batch/seq shapes
- Verify config.json has correct rope_scaling for long context