Skip to content
Chapter 7

Production

Ship It

35 min read5 sections

The final challenge: taking optimized model inference from development to production at scale.

Containerization#

Containers turn a program into a packaged artifact that can run anywhere. For inference, Docker is the standard:

  • Container: An actively running environment that isolates an application
  • Image: An executable package with everything needed to run the software
  • Dockerfile: Instructions for creating an image
  • Registry: Central repository for managing and distributing images
💡Start from Proven Images

Inference engines like vLLM and SGLang offer official base images. Start from these rather than building your own from scratch.

Dependency Management

Dependency chains for inference are long and fragile. Key practices:

  • Pin versions exactly — tools like uv, poetry, or pip will flag incompatibilities
  • Pack light — images are often many gigabytes; only include necessary dependencies
  • Use safetensors — the dominant format for serializing model weights safely

NIMs

NVIDIA Inference Microservices are pre-built Docker containers for popular open models. Available as multi-LLM (flexible) or model-specific (maximum performance) configurations.

Autoscaling#

The goal: always have enough resources to serve all requests while maintaining latency SLAs without wasting money on idle GPUs.

Autoscaling uses Kubernetes for container orchestration. Five key configuration factors:

FactorDescription
Min replicasMinimum running replicas regardless of traffic
Max replicasMaximum replicas during high traffic
Autoscaling windowSliding timeframe for measuring traffic
Scale down delayWait time before scaling down
Concurrency targetRequests each replica handles at once

Concurrency and Batch Sizing

Three batching approaches in order of sophistication:

  1. Static batching: Wait until batch is full
  2. Dynamic batching: Wait until full OR timeout
  3. Continuous batching: Token-level interleaving (the standard for modern engines)

Batch sizing trades latency for throughput. Test multiple sizes to find the right fit.

Cold Starts

Cold start time = GPU procurement + image loading + model loading + engine startup. Optimizations:

  • Smaller images: Include only necessary dependencies
  • Quantized weights: Smaller files load faster
  • Cached engines: Pre-compile TensorRT-LLM engines and cache them
  • Local weight storage: Load from within the same datacenter, not remote S3

Routing, Load Balancing, and Queueing

Routers decide where should this request go? Load balancers decide where could this request go? Intelligent routing uses:

  • KV cache-aware routing: Send to replicas with matching prefix
  • LoRA-aware routing: Send to replicas with desired fine-tune weights
  • Queue management: First-in-first-out with optional priority queuing

Scale to Zero

Scale down to zero replicas when idle, scale up when traffic arrives. Requires fast cold starts and robust queueing. Best for development and periodic batch workloads, not latency-sensitive production.

Multi-Cloud Capacity Management#

High-volume deployments need thousands of GPUs distributed globally. True multi-cloud requires treating distinct pools as fungible compute:

  • Control plane: Global model deployment and scaling decisions
  • Workload planes: Direct inference traffic and in-cluster scaling

Benefits: capacity pooling, redundancy, low-latency (5ms per timezone), and compliance.

GPU Procurement

TypeDescription
ReservedHundreds/thousands of GPUs for months/years at discount
On-demandIndividual instances as needed at higher cost
SpotDiscounted instances that can be pre-empted

Large-scale inference uses a blend: reserved baseline + on-demand/spot for peaks.

Building for Reliability

GPUs fail. Meta's Llama 3 training saw 419 interruptions in 54 days across 16,000 GPUs — roughly one failure per 50,000 GPU-hours.

Two high-availability postures:

  • Active-active: Multiple regions serve traffic simultaneously; seamless failover
  • Active-passive: Hot standby cluster ready to take over

Key Takeaway

Inference engineers should expect hardware failure. Proactively noting failures, cordoning nodes, and cycling pods keeps clusters healthy. Multi-cloud active-active postures provide the highest reliability.

Testing and Deployment#

Zero-Downtime Deployment

Two strategies:

  • Blue-green: Two parallel environments; shift traffic between them
  • Canary: Start with small share of traffic to new deployment; gradually increase

Shadow traffic — mirroring real requests to a candidate deployment — validates performance without affecting users.

Cost Estimation

Cost per token depends on: GPU cost, utilization, throughput, and precision. Track cost metrics alongside latency metrics for informed optimization decisions.

Observability

Monitor: GPU utilization, VRAM usage, request latency (P50/P90/P99), queue depth, throughput, error rates, and model quality metrics. Alert on deviations from baselines.

Client Code#

Client Latency Overhead

Network latency adds to inference time. Minimize with: regional deployment, connection pooling, and keep-alive connections.

Asynchronous Inference

For non-real-time workloads, submit requests to a queue and poll for results. Decouples request submission from response processing.

Streaming and Protocol Support

ProtocolBest For
HTTP/RESTSimple request-response (most common)
SSE (Server-Sent Events)LLM token streaming
WebSocketsBidirectional streaming (voice, real-time)
gRPCHigh-performance, schema-first APIs

Key Takeaway

For LLM inference, SSE streaming is the standard. For voice applications (ASR/TTS), WebSockets enable the lowest-latency bidirectional streaming. Choose protocols based on your application's real-time requirements.

Check Your Understanding

1 / 12

Which batching strategy do modern inference engines use?