Prerequisites: Latency, Throughput & Budgets
4 sections · Quick reference card
Core Formulas
End-to-end latency
latency = TTFT + (output_tokens × TPOT)
Throughput
throughput = total_output_tokens / wall_clock_seconds
Cost per token
cost_per_token = instance_cost_per_hour / (throughput × 3600)
Concurrency
concurrency = throughput × avg_latency [Little's Law]
GPU utilization
MFU = achieved_FLOPS / peak_FLOPS [Model FLOP Utilization]
Latency Budget Targets
| Use Case | TTFT Target | TPOT Target |
|---|---|---|
| Interactive chat | < 500 ms | < 30 ms |
| Streaming assistant | < 1 s | < 50 ms |
| Batch summarization | < 10 s | < 100 ms |
| Offline processing | no target | maximize throughput |
Key Definitions
- P50 / P95 / P99
- Percentile latencies. P99 = worst 1 in 100 requests. Use P99 for SLA commitments.
- SLO
- Service Level Objective — internal target (e.g., P99 TTFT < 1s).
- SLA
- Service Level Agreement — contractual commitment to customers.
- Cold start
- Latency penalty when a model is not loaded in GPU memory. Can be 10-60s.
- Warm pool
- Pre-loaded instances kept idle to avoid cold starts at the cost of idle GPU spend.
Model Selection Checklist
- Define quality metric (MMLU, MT-Bench, task-specific eval)
- Set latency budget before picking model size
- Benchmark on YOUR data — leaderboards are proxies
- Check context length requirements
- Estimate VRAM: model_params_B × bytes_per_param + KV cache
- Consider fine-tuning smaller model vs. prompting larger one