Skip to content
Chapter 7

Production

Ship It

35 min read5 sections

By the end of this chapter you will be able to

  • 1Calculate minimum replica count, GPU budget, and autoscaling thresholds from SLO requirements
  • 2Design a zero-downtime deployment pipeline using blue-green or canary strategies
  • 3Define the key observability metrics for an inference service and what to alert on
  • 4Choose the right streaming protocol (SSE, WebSockets, gRPC) for your application's requirements

The final challenge: taking optimized model inference from development to production at scale.

Containerization#

Containers turn a program into a packaged artifact that can run anywhere. For inference, Docker is the standard:

  • Container: An actively running environment that isolates an application
  • Image: An executable package with everything needed to run the software
  • Dockerfile: Instructions for creating an image
  • Registry: Central repository for managing and distributing images
💡Start from Proven Images

Inference engines like vLLM and SGLang offer official base images. Start from these rather than building your own from scratch.

Dependency Management

Dependency chains for inference are long and fragile. Key practices:

  • Pin versions exactly — tools like uv, poetry, or pip will flag incompatibilities
  • Pack light — images are often many gigabytes; only include necessary dependencies
  • Use safetensors — the dominant format for serializing model weights safely

NIMs

NVIDIA Inference Microservices are pre-built Docker containers for popular open models. Available as multi-LLM (flexible) or model-specific (maximum performance) configurations.

Autoscaling#

The goal: always have enough resources to serve all requests while maintaining latency SLAs without wasting money on idle GPUs.

Autoscaling uses Kubernetes for container orchestration. Five key configuration factors:

FactorDescription
Min replicasMinimum running replicas regardless of traffic
Max replicasMaximum replicas during high traffic
Autoscaling windowSliding timeframe for measuring traffic
Scale down delayWait time before scaling down
Concurrency targetRequests each replica handles at once

Concurrency and Batch Sizing

Three batching approaches in order of sophistication:

  1. Static batching: Wait until batch is full
  2. Dynamic batching: Wait until full OR timeout
  3. Continuous batching: Token-level interleaving (the standard for modern engines)

Batch sizing trades latency for throughput. Test multiple sizes to find the right fit.

Cold Starts

Cold start time = GPU procurement + image loading + model loading + engine startup. Optimizations:

  • Smaller images: Include only necessary dependencies
  • Quantized weights: Smaller files load faster
  • Cached engines: Pre-compile TensorRT-LLM engines and cache them
  • Local weight storage: Load from within the same datacenter, not remote S3

Routing, Load Balancing, and Queueing

Routers decide where should this request go? Load balancers decide where could this request go? Intelligent routing uses:

  • KV cache-aware routing: Send to replicas with matching prefix
  • LoRA-aware routing: Send to replicas with desired fine-tune weights
  • Queue management: First-in-first-out with optional priority queuing

Scale to Zero

Scale down to zero replicas when idle, scale up when traffic arrives. Requires fast cold starts and robust queueing. Best for development and periodic batch workloads, not latency-sensitive production.

Multi-Cloud Capacity Management#

High-volume deployments need thousands of GPUs distributed globally. True multi-cloud requires treating distinct pools as fungible compute:

  • Control plane: Global model deployment and scaling decisions
  • Workload planes: Direct inference traffic and in-cluster scaling

Benefits: capacity pooling, redundancy, low-latency (5ms per timezone), and compliance.

GPU Procurement

TypeDescription
ReservedHundreds/thousands of GPUs for months/years at discount
On-demandIndividual instances as needed at higher cost
SpotDiscounted instances that can be pre-empted

Large-scale inference uses a blend: reserved baseline + on-demand/spot for peaks.

Building for Reliability

GPUs fail. Meta's Llama 3 training saw 419 interruptions in 54 days across 16,000 GPUs — roughly one failure per 50,000 GPU-hours.

Two high-availability postures:

  • Active-active: Multiple regions serve traffic simultaneously; seamless failover
  • Active-passive: Hot standby cluster ready to take over

Key Takeaway

Inference engineers should expect hardware failure. Proactively noting failures, cordoning nodes, and cycling pods keeps clusters healthy. Multi-cloud active-active postures provide the highest reliability.

Testing and Deployment#

Zero-Downtime Deployment

Two strategies:

  • Blue-green: Two parallel environments; shift traffic between them
  • Canary: Start with small share of traffic to new deployment; gradually increase

Shadow traffic — mirroring real requests to a candidate deployment — validates performance without affecting users.

Cost Estimation

Cost per token depends on: GPU cost, utilization, throughput, and precision. Track cost metrics alongside latency metrics for informed optimization decisions.

Observability

Monitor: GPU utilization, VRAM usage, request latency (P50/P90/P99), queue depth, throughput, error rates, and model quality metrics. Alert on deviations from baselines.

Client Code#

Client Latency Overhead

Network latency adds to inference time. Minimize with: regional deployment, connection pooling, and keep-alive connections.

Asynchronous Inference

For non-real-time workloads, submit requests to a queue and poll for results. Decouples request submission from response processing.

Streaming and Protocol Support

ProtocolBest For
HTTP/RESTSimple request-response (most common)
SSE (Server-Sent Events)LLM token streaming
WebSocketsBidirectional streaming (voice, real-time)
gRPCHigh-performance, schema-first APIs

Key Takeaway

For LLM inference, SSE streaming is the standard. For voice applications (ASR/TTS), WebSockets enable the lowest-latency bidirectional streaming. Choose protocols based on your application's real-time requirements.

Try it: Model-to-Hardware Recommender

Input your model, throughput target, and latency SLO — get ranked GPU configurations with reasoning, ready for your deployment planning.

Try it: VRAM Calculator

Validate your capacity plan: compute the exact VRAM budget for your model at production batch sizes before you scale.

Check Your Understanding

1 / 12

Which batching strategy do modern inference engines use?

Now put it into practice

📦

Free download

Get all 8 cheat sheets as one PDF

Get the complete cheat sheet bundle

All 8 cheat sheets in one PDF — formulas, GPU specs, framework comparisons, and deployment checklists. Free, instant download.

Free forever. Unsubscribe anytime.