Inference: Three-Layer Framework
4 sections · Quick reference card
The Three Layers
- Runtime
- Execution of the model: kernels, attention, KV cache, batching, quantization. Happens on GPU.
- Infrastructure
- Servers, networking, autoscaling, load balancing, multi-cloud capacity. Happens on machines.
- Tooling
- APIs, SDKs, observability, routing, caching at application level. Happens in software.
Key Terms
- Inference
- Running a trained model on new inputs to produce outputs — not training.
- Prefill
- Processing the entire input prompt in parallel. Compute-bound. One forward pass.
- Decode
- Auto-regressive token generation one token at a time. Memory-bandwidth-bound.
- TTFT
- Time to First Token — measures prefill latency. Critical for interactive apps.
- TPOT
- Time Per Output Token — measures decode speed. Drives streaming UX.
- Throughput
- Tokens generated per second across all concurrent requests. Measures hardware utilization.
Runtime / Infra / Tooling Split
| Layer | Concern | Example Tools |
|---|---|---|
| Runtime | Kernel efficiency, batching | vLLM, TensorRT-LLM, SGLang |
| Infrastructure | Scaling, placement, cost | Kubernetes, Karpenter, Ray |
| Tooling | Routing, caching, evals | LiteLLM, Prometheus, Langfuse |
Inference Lifecycle Steps
- Request arrives at load balancer
- Router selects backend instance
- KV cache checked for prefix reuse
- Prefill: prompt tokens processed in parallel
- Decode: tokens generated auto-regressively
- Response streamed back to client
- Metrics emitted to observability stack