Skip to content
Chapter 4

Software

From CUDA to Inference Engines

25 min read5 sections

By the end of this chapter you will be able to

  • 1Describe the full software stack from CUDA runtime to application API
  • 2Compare vLLM, SGLang, and TensorRT-LLM on performance, ease of use, and hardware support
  • 3Explain what continuous batching does and why it matters for throughput
  • 4Design a benchmarking protocol that produces actionable, reproducible results
๐Ÿ’กLooking for a quick inference engine comparison?

This chapter covers the full software stack from CUDA to inference engines to NVIDIA Dynamo. For a scannable side-by-side decision guide on vLLM, SGLang, and TensorRT-LLM with benchmark data, see vLLM vs SGLang vs TensorRT-LLM: Which Inference Engine to Choose.

The software stack for inference has four layers of abstraction:

  • CUDA: Direct communication to the GPU for explicit control over computations and memory
  • Deep learning frameworks: Abstractions over CUDA for training, exporting, and running neural networks in Python
  • Inference engines: Highly configurable PyTorch-backed inference for common architectures
  • NVIDIA Dynamo: Sits on top of inference engines to power large-scale deployments

Most inference engineering today happens at the higher levels, configuring and deploying inference engines. No matter what level you work at, it's essential to have a strong mental model for the adjacent levels.

CUDA#

๐Ÿ“–CUDA

CUDA (Compute Unified Device Architecture) is NVIDIA's proprietary computing platform and programming model for executing parallel tasks on GPUs. It's the foundation for the entire generative AI ecosystem on NVIDIA hardware.

CUDA has four key components:

  • CUDA kernel: A function that executes parallelized code on the GPU
  • CUDA graph: A DAG of kernels and GPU operations for optimizing repeated workflows
  • CUDA driver: Low-level interface between the application and GPU hardware
  • CUDA runtime: Developer-facing API for launching kernels and managing memory

CUDA is not a programming language โ€” programs are written in C++, then compiled into separate CPU and GPU code by a compiler like nvcc.

Writing CUDA kernels shifts inference engineering from thinking about algorithms to thinking about implementations. The traditional attention algorithm can be expressed in a few dozen lines, but FlashAttention โ€” the same mathematical operation โ€” takes tens of thousands of lines to implement efficiently for a specific GPU.

CUDA Kernels for Inference

The prior art predates CUDA by decades. BLAS (Basic Linear Algebra Subprograms) is a specification for common linear algebra operations. cuBLAS brings this to GPUs, and cuDNN provides neural network primitives.

The most frequently used operation is GEMM (General Matrix-Matrix Multiplication). Key libraries:

LibraryPurpose
cuBLASPre-built kernels for essential linear algebra
CUTLASSTemplate library for writing high-performance kernels (used by FlashAttention 3)
CuTeAbstractions for tiled tensor operations on recent architectures
FlashInferHigh-performance attention kernels and fused sampling functions
DeepGEMMEfficient FP8 GEMM kernels from the DeepSeek team

CUDA Kernel Selection

Kernel implementations are highly specialized with hard-coded values based on specific GPU hardware. A kernel written for an H100 will likely not take advantage of B200 architecture. Most kernel selection is automatic โ€” deep learning frameworks have pre-configured kernels for various architectures.

Reducing Memory Accesses with Kernel Fusion

Running two kernels back-to-back on the same data creates unnecessary round-trips to memory. Kernel fusion combines multiple kernels into a single kernel that handles both operations, eliminating intermediate reads and writes.

Key Takeaway

During decode (the bandwidth-bound phase), an inference engine can't afford unnecessary memory accesses. Kernel fusion โ€” both automatic (via compilers) and manual (like FlashAttention) โ€” is essential for performance.

Deep Learning Frameworks#

PyTorch is the industry standard technology underlying both training and inference. Originally created at Meta, now part of the Linux Foundation.

PyTorch balances built-in functions and automatic optimizations with manual control. The step that transforms a model from training to inference is compilation (torch.compile), which performs automatic kernel selection and fusion for a specific GPU.

Model File Formats

  • safetensors: The dominant format for serializing model weights. Uses memory mapping for fast, safe loading. Only holds tensor data, not executable code.
  • ONNX: Stores weights along with an execution graph. Highly portable across hardware.

ONNX Runtime and TensorRT

ONNX RuntimeTensorRT
SourceOpen source (Linux Foundation)Mix of proprietary and open (NVIDIA)
HardwareMany GPU typesNVIDIA only
StrengthPortabilityRaw performance

Transformers and Diffusers

The transformers and diffusers libraries by Hugging Face offer reference implementations โ€” great for learning and tinkering, but not designed for large-scale production inference. Use them for local inference and notebooks, then switch to production inference engines.

Inference Engines#

Three competitive engines: vLLM, SGLang, and TensorRT-LLM.

vLLMSGLangTensorRT-LLM
PerformanceGoodGoodBest
Ease of useEasyEasyHard
Model supportMostMostSome
HardwareGPU, TPUNVIDIA, AMDNVIDIA only

All three support continuous batching, post-training quantization, speculative decoding, prefix caching, parallelism, and disaggregation out of the box.

vLLM

Largest market share. First released summer 2023. Best selling point: broad support โ€” most hardware options, most model architectures, plus multimodal inference via vLLM Omni. Use when you want solid out-of-the-box performance for almost any model.

SGLang

Pairs a fast backend runtime with a flexible frontend language. Strong support for Chinese open models (DeepSeek, Qwen). Heavy investment in large-scale MoE deployments on systems like GB200 NVL72.

TensorRT-LLM

Steeper learning curve but usually best performance. Deep NVIDIA hardware integration with graph-level optimizations. At Baseten, it's the most-used engine.

Inference Orchestration#

Once you scale beyond a single replica, you need an orchestration layer above the inference engine to route requests, manage KV cache across nodes, and coordinate disaggregated prefill/decode workers. Inference engines (vLLM, SGLang, TensorRT-LLM) run a model on a node; orchestration frameworks coordinate many of them.

NVIDIA Dynamo is the leading open-source orchestration framework for vLLM and other engines. It's a distributed serving platform that sits on top of inference engines and manages:

  • KV-cache-aware routing โ€” send requests to the replica that already holds the matching prefix, maximizing cache hits and lowering TTFT
  • Disaggregated serving โ€” coordinate separate prefill and decode worker pools, scaling each independently
  • Multi-GPU / multi-node orchestration โ€” the coordination layer for large-scale, multi-replica deployments

Other building blocks in this layer include router/proxy tools like LiteLLM and nginx for application-level routing, and Kubernetes operators for autoscaling. The distinction to remember: the inference engine runs the model; the orchestration framework coordinates fleets of engines.

Performance Benchmarking#

โš ๏ธBenchmark Carefully

Benchmarks should mirror real-world usage as closely as possible. Use jitter traffic (randomized arrival times and sequence shapes) rather than uniform synthetic loads. Always measure P50, P90, and P99 latencies.

Key benchmarking tools: genai-perf (NVIDIA), vllm-benchmark, sglang-bench. Profiling tools: PyTorch Profiler, NVIDIA Nsight Systems and Nsight Compute.

Tips for useful benchmarking:

  1. Define realistic workloads: Match your actual ISL/OSL distributions
  2. Warm up first: Let the engine stabilize before measuring
  3. Test multiple configurations: Vary batch size, parallelism, quantization
  4. Measure what matters: TTFT and TPS for user-facing, total throughput for batch

Try it: GPU Selection Advisor โ†’

Answer questions about your model, workload, and budget to get ranked GPU recommendations โ€” pairs directly with the engine comparison above.

Check Your Understanding

1 / 11

What is kernel fusion?

Now put it into practice

๐Ÿ“ฆ

Free download

Get all 8 cheat sheets as one PDF

Get the complete cheat sheet bundle

All 8 cheat sheets in one PDF โ€” formulas, GPU specs, framework comparisons, and deployment checklists. Free, instant download.

Free forever. Unsubscribe anytime.

Continue to Chapter 5

Techniques: Optimization Deep Dives

โ†’