The final challenge: taking optimized model inference from development to production at scale.

Containerization#

Containers turn a program into a packaged artifact that can run anywhere. For inference, Docker is the standard:

Container: An actively running environment that isolates an application
Image: An executable package with everything needed to run the software
Dockerfile: Instructions for creating an image
Registry: Central repository for managing and distributing images

💡Start from Proven Images

Inference engines like vLLM and SGLang offer official base images. Start from these rather than building your own from scratch.

Dependency Management

Dependency chains for inference are long and fragile. Key practices:

Pin versions exactly — tools like uv, poetry, or pip will flag incompatibilities
Pack light — images are often many gigabytes; only include necessary dependencies
Use safetensors — the dominant format for serializing model weights safely

NIMs

NVIDIA Inference Microservices are pre-built Docker containers for popular open models. Available as multi-LLM (flexible) or model-specific (maximum performance) configurations.

Autoscaling#

The goal: always have enough resources to serve all requests while maintaining latency SLAs without wasting money on idle GPUs.

Autoscaling uses Kubernetes for container orchestration. Five key configuration factors:

Factor	Description
Min replicas	Minimum running replicas regardless of traffic
Max replicas	Maximum replicas during high traffic
Autoscaling window	Sliding timeframe for measuring traffic
Scale down delay	Wait time before scaling down
Concurrency target	Requests each replica handles at once

Autoscaling Simulator

Adjust traffic and configuration to see how autoscaling responds.

Requests / second50 req/s

Concurrency Target (per replica)8

Min replicas1

Max replicas10

Replicas (7 active)

●

○

Utilization

89%

Status

Balanced

Needed

Total capacity

56 req/s

Concurrency and Batch Sizing

Three batching approaches in order of sophistication:

Static batching: Wait until batch is full
Dynamic batching: Wait until full OR timeout
Continuous batching: Token-level interleaving (the standard for modern engines)

Continuous Batching

Requests are interleaved at the token level. As requests finish, new ones fill their slots immediately.

Slot 0

—

Slot 1

—

Slot 2

—

Req A (active)Req B (active)Req C (waiting)Req D (waiting)Req E (waiting)

Unlike static batching which waits for all requests to finish, continuous batching fills freed slots immediately — maximizing GPU utilization.

Batch sizing trades latency for throughput. Test multiple sizes to find the right fit.

Cold Starts

Cold Start Timeline

Adjust each phase to see total cold start time. Optimize the longest phase for the biggest improvement.

GPU Provisioning30 s

Image Loading15 s

Model Loading45 s

Engine Startup60 s

Total Cold Start150s (2.5 min)

30s

45s

60s

GPU ProvisioningImage LoadingModel LoadingEngine Startup

Engine Startup is 40% of cold start — inference engine compilation.

Model Loading is 30% of cold start — model weights written to vram.

Cold start time = GPU procurement + image loading + model loading + engine startup. Optimizations:

Smaller images: Include only necessary dependencies
Quantized weights: Smaller files load faster
Cached engines: Pre-compile TensorRT-LLM engines and cache them
Local weight storage: Load from within the same datacenter, not remote S3

Routing, Load Balancing, and Queueing

Routers decide where should this request go? Load balancers decide where could this request go? Intelligent routing uses:

KV cache-aware routing: Send to replicas with matching prefix
LoRA-aware routing: Send to replicas with desired fine-tune weights
Queue management: First-in-first-out with optional priority queuing

Scale to Zero

Scale down to zero replicas when idle, scale up when traffic arrives. Requires fast cold starts and robust queueing. Best for development and periodic batch workloads, not latency-sensitive production.

Multi-Cloud Capacity Management#

High-volume deployments need thousands of GPUs distributed globally. True multi-cloud requires treating distinct pools as fungible compute:

Control plane: Global model deployment and scaling decisions
Workload planes: Direct inference traffic and in-cluster scaling

Benefits: capacity pooling, redundancy, low-latency (5ms per timezone), and compliance.

GPU Procurement

Type	Description
Reserved	Hundreds/thousands of GPUs for months/years at discount
On-demand	Individual instances as needed at higher cost
Spot	Discounted instances that can be pre-empted

Large-scale inference uses a blend: reserved baseline + on-demand/spot for peaks.

Building for Reliability

GPUs fail. Meta's Llama 3 training saw 419 interruptions in 54 days across 16,000 GPUs — roughly one failure per 50,000 GPU-hours.

Two high-availability postures:

Active-active: Multiple regions serve traffic simultaneously; seamless failover
Active-passive: Hot standby cluster ready to take over

Key Takeaway

Inference engineers should expect hardware failure. Proactively noting failures, cordoning nodes, and cycling pods keeps clusters healthy. Multi-cloud active-active postures provide the highest reliability.

Testing and Deployment#

Zero-Downtime Deployment

Two strategies:

Blue-green: Two parallel environments; shift traffic between them
Canary: Start with small share of traffic to new deployment; gradually increase

Shadow traffic — mirroring real requests to a candidate deployment — validates performance without affecting users.

Cost Estimation

Cost per token depends on: GPU cost, utilization, throughput, and precision. Track cost metrics alongside latency metrics for informed optimization decisions.

Observability

Monitor: GPU utilization, VRAM usage, request latency (P50/P90/P99), queue depth, throughput, error rates, and model quality metrics. Alert on deviations from baselines.

Client Code#

Client Latency Overhead

Network latency adds to inference time. Minimize with: regional deployment, connection pooling, and keep-alive connections.

Asynchronous Inference

For non-real-time workloads, submit requests to a queue and poll for results. Decouples request submission from response processing.

Streaming and Protocol Support

Protocol	Best For
HTTP/REST	Simple request-response (most common)
SSE (Server-Sent Events)	LLM token streaming
WebSockets	Bidirectional streaming (voice, real-time)
gRPC	High-performance, schema-first APIs

Key Takeaway

For LLM inference, SSE streaming is the standard. For voice applications (ASR/TTS), WebSockets enable the lowest-latency bidirectional streaming. Choose protocols based on your application's real-time requirements.

Try it: Model-to-Hardware Recommender →

Input your model, throughput target, and latency SLO — get ranked GPU configurations with reasoning, ready for your deployment planning.

Try it: VRAM Calculator →

Validate your capacity plan: compute the exact VRAM budget for your model at production batch sizes before you scale.

Check Your Understanding

1 / 12

Which batching strategy do modern inference engines use?

Now put it into practice

🖥️

Interactive tool

Model-to-Hardware Recommender

💾

Interactive tool

VRAM Calculator

✅

Check your understanding

Take the chapter quiz

📄

Quick reference

Download the cheat sheet

📦

Free download

Get all 8 cheat sheets as one PDF

Get the complete cheat sheet bundle

All 8 cheat sheets in one PDF — formulas, GPU specs, framework comparisons, and deployment checklists. Free, instant download.

Free forever. Unsubscribe anytime.