Skip to content
Cheat Sheets/Inference: Three-Layer Framework

Inference: Three-Layer Framework

4 sections · Quick reference card

The Three Layers

Runtime
Execution of the model: kernels, attention, KV cache, batching, quantization. Happens on GPU.
Infrastructure
Servers, networking, autoscaling, load balancing, multi-cloud capacity. Happens on machines.
Tooling
APIs, SDKs, observability, routing, caching at application level. Happens in software.

Key Terms

Inference
Running a trained model on new inputs to produce outputs — not training.
Prefill
Processing the entire input prompt in parallel. Compute-bound. One forward pass.
Decode
Auto-regressive token generation one token at a time. Memory-bandwidth-bound.
TTFT
Time to First Token — measures prefill latency. Critical for interactive apps.
TPOT
Time Per Output Token — measures decode speed. Drives streaming UX.
Throughput
Tokens generated per second across all concurrent requests. Measures hardware utilization.

Runtime / Infra / Tooling Split

LayerConcernExample Tools
RuntimeKernel efficiency, batchingvLLM, TensorRT-LLM, SGLang
InfrastructureScaling, placement, costKubernetes, Karpenter, Ray
ToolingRouting, caching, evalsLiteLLM, Prometheus, Langfuse

Inference Lifecycle Steps

  • Request arrives at load balancer
  • Router selects backend instance
  • KV cache checked for prefix reuse
  • Prefill: prompt tokens processed in parallel
  • Decode: tokens generated auto-regressively
  • Response streamed back to client
  • Metrics emitted to observability stack