Skip to content
Cheat Sheets/Software: Stack & Inference Engines

Software: Stack & Inference Engines

4 sections · Quick reference card

Software Stack Layers

LayerWhat It DoesExamples
ApplicationAPI, routing, auth, cachingFastAPI, LiteLLM, nginx
Inference EngineBatching, KV cache mgmt, schedulingvLLM, SGLang, TRT-LLM
DL FrameworkTensor ops, autograd, model loadingPyTorch, JAX
Kernel LibrariesOptimized GPU opscuDNN, CUTLASS, FlashAttention
CUDA RuntimeGPU management, memory, streamsCUDA, ROCm, OpenCL
Driver / HardwarePhysical GPU executionNVIDIA Driver, H100, A100

Inference Engine Comparison

EngineStrengthsBest For
vLLMPagedAttention, OpenAI-compatible API, broad model supportGeneral serving, research
SGLangRadixAttention, structured generation, low latencyAgentic workloads, constrained gen
TensorRT-LLMNVIDIA-optimized kernels, highest throughputProduction on NVIDIA hardware
llama.cppCPU+GPU, GGUF quantization, low memoryEdge, local, consumer hardware
MLC LLMMulti-platform (CUDA/Metal/WebGPU)Cross-platform deployment

Key Concepts

PagedAttention
KV cache stored in non-contiguous memory pages (like OS virtual memory). Eliminates fragmentation. Enables high batch sizes. Core of vLLM.
Continuous batching
New requests inserted into batch as soon as a slot frees. Eliminates padding waste. Also called iteration-level scheduling.
CUDA graphs
Capture GPU kernel sequence as a graph, replay with minimal CPU overhead. Reduces latency for fixed batch sizes.
FlashAttention
IO-aware attention kernel. Fuses softmax + matmul. Avoids materializing N×N attention matrix. 2-4× faster, 5-20× less memory.
GGUF
File format for quantized models (successor to GGML). Used by llama.cpp. Stores model weights + metadata in one file.

Model Format Checklist

  • Safetensors: prefer over .bin for security + mmap support
  • FP16 baseline: most inference engines expect fp16 weights
  • GGUF: use for llama.cpp and consumer deployment
  • TensorRT engine: pre-compile for fixed batch/seq shapes
  • Verify config.json has correct rope_scaling for long context