Production: Autoscaling & Deployment
4 sections · Quick reference card
Autoscaling Formulas
Minimum replicas
min_replicas = ceil(peak_rps × avg_latency_s / requests_per_replica)
GPU count for throughput
gpus = ceil(target_tokens_per_sec / throughput_per_gpu)
Queue depth target
Scale up when queue_depth > scale_up_threshold (typically 2–5 pending requests)
Scale-down delay
Wait cooldown_period (60–300s) before removing replicas to avoid thrashing
Cold Start Mitigation Checklist
- Keep minimum 1 warm replica per deployment to absorb bursts
- Use container image pre-caching on all nodes (pull on startup)
- Store model weights on fast NVMe or in-memory (tmpfs) on node
- Use P2P weight loading — GPUs pull directly from NVMe, bypass CPU
- Implement readiness probes that wait until model is loaded
- Pre-warm KV cache with a dummy request after load
- Use spot/preemptible with fallback to on-demand for cost + availability
Batching Strategies
| Strategy | How It Works | Best For |
|---|---|---|
| Static batching | Wait until batch_size requests, then run | Offline, predictable workloads |
| Dynamic batching | Batch up to max_size or max_wait timeout | Moderate traffic, mixed lengths |
| Continuous batching | Insert new requests mid-generation at token boundaries | Production LLM serving, high concurrency |
| Chunked prefill | Split long prompts into chunks, interleave with decode | Mixed long/short context, reducing TTFT variance |
Observability Key Metrics
- TTFT P99
- Time to first token at 99th percentile. Primary SLO metric for interactive apps.
- TPOT P50
- Median time per output token. Tracks streaming smoothness.
- GPU memory utilization
- Target 80–90%. Below 70% = underutilized. Above 95% = risk of OOM.
- Request queue depth
- Pending requests waiting for a slot. Alert if > 10 for > 30s.
- Token throughput
- Output tokens/second per GPU. Compare to theoretical max to measure efficiency.
- KV cache hit rate
- Fraction of prefill tokens served from cache. Track for prompt caching systems.