The final challenge: taking optimized model inference from development to production at scale.
Containerization#
Containers turn a program into a packaged artifact that can run anywhere. For inference, Docker is the standard:
- Container: An actively running environment that isolates an application
- Image: An executable package with everything needed to run the software
- Dockerfile: Instructions for creating an image
- Registry: Central repository for managing and distributing images
Inference engines like vLLM and SGLang offer official base images. Start from these rather than building your own from scratch.
Dependency Management
Dependency chains for inference are long and fragile. Key practices:
- Pin versions exactly — tools like
uv,poetry, orpipwill flag incompatibilities - Pack light — images are often many gigabytes; only include necessary dependencies
- Use safetensors — the dominant format for serializing model weights safely
NIMs
NVIDIA Inference Microservices are pre-built Docker containers for popular open models. Available as multi-LLM (flexible) or model-specific (maximum performance) configurations.
Autoscaling#
The goal: always have enough resources to serve all requests while maintaining latency SLAs without wasting money on idle GPUs.
Autoscaling uses Kubernetes for container orchestration. Five key configuration factors:
| Factor | Description |
|---|---|
| Min replicas | Minimum running replicas regardless of traffic |
| Max replicas | Maximum replicas during high traffic |
| Autoscaling window | Sliding timeframe for measuring traffic |
| Scale down delay | Wait time before scaling down |
| Concurrency target | Requests each replica handles at once |
Concurrency and Batch Sizing
Three batching approaches in order of sophistication:
- Static batching: Wait until batch is full
- Dynamic batching: Wait until full OR timeout
- Continuous batching: Token-level interleaving (the standard for modern engines)
Batch sizing trades latency for throughput. Test multiple sizes to find the right fit.
Cold Starts
Cold start time = GPU procurement + image loading + model loading + engine startup. Optimizations:
- Smaller images: Include only necessary dependencies
- Quantized weights: Smaller files load faster
- Cached engines: Pre-compile TensorRT-LLM engines and cache them
- Local weight storage: Load from within the same datacenter, not remote S3
Routing, Load Balancing, and Queueing
Routers decide where should this request go? Load balancers decide where could this request go? Intelligent routing uses:
- KV cache-aware routing: Send to replicas with matching prefix
- LoRA-aware routing: Send to replicas with desired fine-tune weights
- Queue management: First-in-first-out with optional priority queuing
Scale to Zero
Scale down to zero replicas when idle, scale up when traffic arrives. Requires fast cold starts and robust queueing. Best for development and periodic batch workloads, not latency-sensitive production.
Multi-Cloud Capacity Management#
High-volume deployments need thousands of GPUs distributed globally. True multi-cloud requires treating distinct pools as fungible compute:
- Control plane: Global model deployment and scaling decisions
- Workload planes: Direct inference traffic and in-cluster scaling
Benefits: capacity pooling, redundancy, low-latency (5ms per timezone), and compliance.
GPU Procurement
| Type | Description |
|---|---|
| Reserved | Hundreds/thousands of GPUs for months/years at discount |
| On-demand | Individual instances as needed at higher cost |
| Spot | Discounted instances that can be pre-empted |
Large-scale inference uses a blend: reserved baseline + on-demand/spot for peaks.
Building for Reliability
GPUs fail. Meta's Llama 3 training saw 419 interruptions in 54 days across 16,000 GPUs — roughly one failure per 50,000 GPU-hours.
Two high-availability postures:
- Active-active: Multiple regions serve traffic simultaneously; seamless failover
- Active-passive: Hot standby cluster ready to take over
Key Takeaway
Inference engineers should expect hardware failure. Proactively noting failures, cordoning nodes, and cycling pods keeps clusters healthy. Multi-cloud active-active postures provide the highest reliability.
Testing and Deployment#
Zero-Downtime Deployment
Two strategies:
- Blue-green: Two parallel environments; shift traffic between them
- Canary: Start with small share of traffic to new deployment; gradually increase
Shadow traffic — mirroring real requests to a candidate deployment — validates performance without affecting users.
Cost Estimation
Cost per token depends on: GPU cost, utilization, throughput, and precision. Track cost metrics alongside latency metrics for informed optimization decisions.
Observability
Monitor: GPU utilization, VRAM usage, request latency (P50/P90/P99), queue depth, throughput, error rates, and model quality metrics. Alert on deviations from baselines.
Client Code#
Client Latency Overhead
Network latency adds to inference time. Minimize with: regional deployment, connection pooling, and keep-alive connections.
Asynchronous Inference
For non-real-time workloads, submit requests to a queue and poll for results. Decouples request submission from response processing.
Streaming and Protocol Support
| Protocol | Best For |
|---|---|
| HTTP/REST | Simple request-response (most common) |
| SSE (Server-Sent Events) | LLM token streaming |
| WebSockets | Bidirectional streaming (voice, real-time) |
| gRPC | High-performance, schema-first APIs |
Key Takeaway
For LLM inference, SSE streaming is the standard. For voice applications (ASR/TTS), WebSockets enable the lowest-latency bidirectional streaming. Choose protocols based on your application's real-time requirements.
Check Your Understanding
1 / 12Which batching strategy do modern inference engines use?