Skip to content
Cheat Sheets/Production: Autoscaling & Deployment

Production: Autoscaling & Deployment

4 sections · Quick reference card

Autoscaling Formulas

Minimum replicas

min_replicas = ceil(peak_rps × avg_latency_s / requests_per_replica)

GPU count for throughput

gpus = ceil(target_tokens_per_sec / throughput_per_gpu)

Queue depth target

Scale up when queue_depth > scale_up_threshold (typically 2–5 pending requests)

Scale-down delay

Wait cooldown_period (60–300s) before removing replicas to avoid thrashing

Cold Start Mitigation Checklist

  • Keep minimum 1 warm replica per deployment to absorb bursts
  • Use container image pre-caching on all nodes (pull on startup)
  • Store model weights on fast NVMe or in-memory (tmpfs) on node
  • Use P2P weight loading — GPUs pull directly from NVMe, bypass CPU
  • Implement readiness probes that wait until model is loaded
  • Pre-warm KV cache with a dummy request after load
  • Use spot/preemptible with fallback to on-demand for cost + availability

Batching Strategies

StrategyHow It WorksBest For
Static batchingWait until batch_size requests, then runOffline, predictable workloads
Dynamic batchingBatch up to max_size or max_wait timeoutModerate traffic, mixed lengths
Continuous batchingInsert new requests mid-generation at token boundariesProduction LLM serving, high concurrency
Chunked prefillSplit long prompts into chunks, interleave with decodeMixed long/short context, reducing TTFT variance

Observability Key Metrics

TTFT P99
Time to first token at 99th percentile. Primary SLO metric for interactive apps.
TPOT P50
Median time per output token. Tracks streaming smoothness.
GPU memory utilization
Target 80–90%. Below 70% = underutilized. Above 95% = risk of OOM.
Request queue depth
Pending requests waiting for a slot. Alert if > 10 for > 30s.
Token throughput
Output tokens/second per GPU. Compare to theoretical max to measure efficiency.
KV cache hit rate
Fraction of prefill tokens served from cache. Track for prompt caching systems.