Skip to content
Cheat Sheets/Prerequisites: Latency, Throughput & Budgets

Prerequisites: Latency, Throughput & Budgets

4 sections · Quick reference card

Core Formulas

End-to-end latency

latency = TTFT + (output_tokens × TPOT)

Throughput

throughput = total_output_tokens / wall_clock_seconds

Cost per token

cost_per_token = instance_cost_per_hour / (throughput × 3600)

Concurrency

concurrency = throughput × avg_latency [Little's Law]

GPU utilization

MFU = achieved_FLOPS / peak_FLOPS [Model FLOP Utilization]

Latency Budget Targets

Use CaseTTFT TargetTPOT Target
Interactive chat< 500 ms< 30 ms
Streaming assistant< 1 s< 50 ms
Batch summarization< 10 s< 100 ms
Offline processingno targetmaximize throughput

Key Definitions

P50 / P95 / P99
Percentile latencies. P99 = worst 1 in 100 requests. Use P99 for SLA commitments.
SLO
Service Level Objective — internal target (e.g., P99 TTFT < 1s).
SLA
Service Level Agreement — contractual commitment to customers.
Cold start
Latency penalty when a model is not loaded in GPU memory. Can be 10-60s.
Warm pool
Pre-loaded instances kept idle to avoid cold starts at the cost of idle GPU spend.

Model Selection Checklist

  • Define quality metric (MMLU, MT-Bench, task-specific eval)
  • Set latency budget before picking model size
  • Benchmark on YOUR data — leaderboards are proxies
  • Check context length requirements
  • Estimate VRAM: model_params_B × bytes_per_param + KV cache
  • Consider fine-tuning smaller model vs. prompting larger one