Software

vLLM vs SGLang vs TensorRT-LLM

9 min readUpdated 2026-06-01

💡This is the quick decision guide

This article is a scannable comparison with a decision framework. For the full deep-dive on the inference software stack — from CUDA kernels to NVIDIA Dynamo — read Chapter 4: vLLM, SGLang & TensorRT-LLM in Context.

TL;DR

vLLM, SGLang, and TensorRT-LLM all implement continuous batching, paged KV caching, and FP8 quantization — the surface-level feature gap has largely closed. What separates them is architecture: vLLM offers the broadest hardware and model coverage with predictable performance; SGLang's RadixAttention and multi-call scheduling win on high-concurrency MoE workloads; TensorRT-LLM's compiled engine path delivers the highest raw throughput on NVIDIA hardware at the cost of significant operational complexity. For most teams, start with vLLM and migrate to the others only when a profiled bottleneck justifies it.

Key Facts

Best raw NVIDIA throughput: TensorRT-LLM (15–25% above vLLM on H100 for dense models)
Easiest production setup: vLLM — pip install, one CLI flag, 400+ model architectures
Best for large MoE / high concurrency: SGLang — RadixAttention + EP gives best DeepSeek-R1/V3 perf
Broadest hardware support: vLLM — NVIDIA, AMD ROCm, Google TPU, Intel Gaudi, CPU fallback
Best structured-output throughput: SGLang — constrained decoding integrated at the scheduler level
Production maturity (2026): All three are production-grade; TRT-LLM most used at Baseten/scale

Choosing an inference engine is one of the highest-leverage decisions in an LLM deployment. The three leading open-source options — vLLM, SGLang, and TensorRT-LLM — all serve the same purpose but make radically different architectural bets. A wrong choice costs you either performance headroom (staying on a slow engine when you've hit scale) or weeks of engineering time (building on a complex engine before you need the performance). This guide cuts through the marketing to explain the architectures, the real performance tradeoffs, and a decision framework that works in practice.

Before the differences, it's worth establishing the baseline. Any modern serving engine you pick in 2026 will include all of these:

Continuous (in-flight) batching — new requests join the running batch the moment a sequence slot opens, instead of waiting for the full batch to drain. This alone is the single biggest throughput unlock over naïve batching.
Paged KV cache — virtual memory management for the attention key/value store. Rather than pre-allocating contiguous VRAM per request, pages are assigned on demand and reclaimed when a request finishes, eliminating fragmentation and doubling usable concurrent sequences vs. static allocation.
Prefix caching — KV pages for a shared prompt prefix (system prompt, few-shot block, conversation history) are retained and reused across requests, cutting TTFT to near-zero for warm prefixes.
Quantization — FP8, INT8, and INT4 formats are supported in all three engines, with negligible quality loss at FP8 and 1–3% at aggressive INT4.
Speculative decoding — draft model + verification pass for latency reduction at low-to-medium batch sizes.
Tensor and pipeline parallelism — multi-GPU sharding strategies for models that exceed single-GPU VRAM.

So the choice is rarely about features — the feature gap closed in 2024–2025. It's about architecture, performance ceiling, operational complexity, and ecosystem fit.

Architecture deep-dive#

vLLM: PagedAttention and the Python-first stack

vLLM was the paper that introduced PagedAttention (Kwon et al., 2023) and largely defined the modern inference engine category. Its architecture has three key layers:

Scheduler — a Python-level scheduler maintains a priority queue of waiting/running/swapped requests. It allocates KV cache pages from a central block table, handles preemption (swapping sequences to CPU when VRAM pressure spikes), and assembles micro-batches for each forward pass.
Attention backend — vLLM has pluggable attention backends (FlashAttention-2, FlashInfer, xFormers). The default FlashInfer backend in recent versions handles both prefill and paged decode efficiently, with fused CUDA kernels that avoid materializing the full attention matrix.
Worker processes — each GPU runs a worker process that executes the forward pass via PyTorch. Tensor parallelism is implemented via NCCL all-reduce across workers. The Python overhead of the scheduler is hidden by the async engine, which overlaps scheduling with GPU execution.

PagedAttention's core insight: treat the KV cache as a page table analogous to OS virtual memory. A 2048-token context doesn't need a single 2048-slot contiguous buffer — it gets dynamically allocated 16-token (or configurable) pages. This means VRAM fragmentation drops from ~40% to under 4%, and you can pack significantly more concurrent sequences.

Performance characteristics: vLLM delivers excellent throughput across a broad range of batch sizes. Its main ceiling is that the Python scheduler adds per-step overhead (typically 200–500 μs per scheduling cycle at high concurrency), and the PyTorch execution path has more overhead than a fully compiled CUDA graph path.

Model support: 400+ architecture families registered in vLLM's model registry. Adding a new HuggingFace model is typically a 50-200 line Python class. This is the main reason vLLM has the broadest model coverage.

Hardware support: NVIDIA (CUDA), AMD (ROCm), Google TPU (XLA), Intel Gaudi (HPU), and CPU fallback. The AMD path is particularly mature — vLLM runs well on MI300X.

SGLang: RadixAttention and the structured-generation runtime

SGLang (Zheng et al., 2024) was designed with two goals that aren't typical in a pure serving engine: (1) a structured generation language for expressing multi-turn, conditional, and parallel generation programs, and (2) a high-performance runtime that makes those programs efficient.

The architectural differentiator is RadixAttention — a KV cache management strategy that goes beyond flat prefix caching. While vLLM's prefix cache handles a single shared prefix per request, RadixAttention builds a radix tree (trie) over all cached KV pages across all requests. When a new request arrives, the engine finds the longest common prefix path in the trie — across all previously cached sequences, not just a system prompt — and reuses those pages. For workloads like few-shot prompting with many examples, this can eliminate 70–90% of prefill computation.

The frontend language (the "SGL" in SGLang) lets you express patterns like:

@sgl.function
def multi_choice_qa(s, question, choices):
    s += sgl.system("You are a helpful assistant.")
    s += sgl.user(question)
    with s.fork(len(choices)) as forks:
        for i, fork in enumerate(forks):
            fork += f"The answer is {choices[i]}: "
            fork += sgl.gen("verdict", max_tokens=20)

This program forks into N parallel generation branches with shared prefixes — exactly the pattern RadixAttention optimizes. The runtime can batch all N branches together because their KV prefixes are identical and already in the trie.

MoE and large-scale deployment: SGLang has particularly strong investment in large mixture-of-experts models. It implements expert parallelism (EP) tightly, and for models like DeepSeek-R1/V3 at full scale (671B parameters), SGLang's EP implementation on GB200 NVL72 clusters is competitive with anything else available. The Chinese ML community (where DeepSeek and Qwen originate) has heavily contributed to SGLang, so these models get first-class treatment.

Structured output: Constrained decoding (JSON schema, regex, grammar) is implemented at the scheduler level in SGLang, not as a post-hoc token filter. This means the overhead is lower and the integration is tighter than in engines that bolt on constrained decoding as a plugin.

Performance characteristics: At high concurrency (100+ simultaneous requests), SGLang's scheduler overhead is lower than vLLM's because more work happens in C++/CUDA rather than Python. The RadixAttention trie lookup adds some overhead per-request but pays for itself many times over when hit rates are high. On H100 benchmarks with 5,980 requests at TP=2, SGLang throughput is competitive with or exceeds vLLM on high-concurrency workloads with shared prefixes.

TensorRT-LLM: the compiled engine path

TensorRT-LLM takes a fundamentally different approach: instead of a Python-based scheduler driving a PyTorch forward pass, it compiles the model into a TensorRT engine — a serialized, hardware-specific CUDA execution plan with fused operations, kernel auto-tuning, and CUDA graph capture throughout.

The pipeline is:

Model conversion — the HuggingFace or NeMo checkpoint is converted to TensorRT-LLM's weight format. This step handles quantization and weight layout transformations.
Engine build — trtllm-build compiles the model into a .engine file. This step runs kernel benchmarks on your specific GPU and sequence length profile, selecting the fastest kernel variants. Build times range from 5 minutes (small models) to 2+ hours (large models, multi-GPU).
Runtime — the compiled engine runs in TensorRT-LLM's C++ runtime with minimal Python overhead. The runtime implements in-flight batching, paged KV cache, and speculative decoding in C++/CUDA.

Why it's faster: Three reasons. First, operator fusion is more aggressive — attention, layer norm, and linear layers are fused into single kernels calibrated to your GPU. Second, CUDA graph capture covers the entire forward pass, eliminating CPU-side kernel launch overhead (each kernel launch from Python takes ~5–10 μs; in a 32-layer model that's significant at low batch sizes). Third, auto-tuned kernel selection — TensorRT benchmarks multiple kernel implementations on your actual hardware during the build phase and picks the fastest for each shape.

The operational cost: Every time you change the model (quantization level, max sequence length, max batch size, GPU type), you rebuild the engine. This isn't a configuration change — it's a multi-hour compilation. Teams using TensorRT-LLM build CI pipelines around engine builds and version the .engine files alongside model artifacts. This is real engineering overhead.

Model and hardware support: TensorRT-LLM supports NVIDIA hardware only (Ampere A100, Hopper H100/H200, Blackwell B200/GB200). It supports fewer model architectures than vLLM but covers the most important ones (Llama, Mistral, Qwen, DeepSeek, GPT families, T5/encoder-decoder). Adding a new architecture requires implementing TensorRT-LLM plugin operations, which is significantly harder than vLLM's Python model class.

Head-to-head comparison#

Dimension	vLLM	SGLang	TensorRT-LLM
Peak throughput (H100, dense)	Excellent	Excellent	Best (15–25% higher)
High-concurrency throughput	Very good	Best	Very good
Time-to-first-token	Good	Best (RadixAttention)	Good
Large MoE (DeepSeek-scale)	Good	Best	Good
Structured output	Good	Best (native)	Limited
Ease of setup	Easiest	Easy	Hard (engine build)
Model coverage	Broadest (400+)	Broad (150+)	Selective (~50)
Hardware support	Broadest (NVIDIA/AMD/TPU/Gaudi)	NVIDIA, AMD	NVIDIA only
Operational complexity	Low	Low	High (engine versioning)
Active development velocity	Very high	Very high	High (NVIDIA-backed)
Community size	Largest	Large	Medium

Performance numbers: what to expect#

These figures are representative of H100 SXM 80 GB deployments. Your workload will differ — benchmark before committing.

Model	Engine	Config	Throughput
Llama-3.1 70B	vLLM	TP=4, FP8	~2,800 tok/s total
Llama-3.1 70B	TRT-LLM	TP=4, FP8, compiled	~3,400 tok/s total
Llama-3.1 70B	SGLang	TP=4, FP8	~2,900 tok/s total
DeepSeek-R1 671B	SGLang	EP=8, FP8	~1,100 tok/s (8× H100)
Llama-3.1 8B	vLLM	TP=1, FP8	~12,000 tok/s
Llama-3.1 8B	TRT-LLM	TP=1, FP8, compiled	~14,500 tok/s

The TRT-LLM advantage is most pronounced at high batch sizes with dense (non-MoE) models where the compiled kernels and CUDA graph path pull ahead. At low batch sizes (latency-sensitive, single-user scenarios), the gap narrows significantly because memory bandwidth dominates and all engines are equally limited by VRAM bandwidth.

Decision framework#

Ask these questions in order. Stop at the first match.

1. Is your hardware non-NVIDIA? → vLLM (or SGLang for AMD). TensorRT-LLM is NVIDIA-only and off the table entirely.

2. Are you running a large MoE model (DeepSeek-R1/V3, Mixtral 8×22B, Qwen-MoE) at scale? → SGLang. Its expert parallelism implementation and RadixAttention give the best throughput/cost ratio for these architectures at production scale.

3. Do you need structured output (JSON schema, constrained grammar) as a primary workload feature? → SGLang. Native constrained decoding at the scheduler level is meaningfully more efficient than vLLM's plugin approach.

4. Is performance your top priority, on stable NVIDIA hardware, with a fixed model version, and you have engineering bandwidth for operational complexity? → TensorRT-LLM. The 15–25% throughput advantage on dense models translates directly to cost savings at scale. At 1M tokens/day that overhead pays for itself in GPU bills.

5. Are you still iterating on models, architecture, or prompting strategy? → vLLM. The fast iteration loop (swap in any HuggingFace model with one flag) is worth far more than raw throughput during development. You can always migrate to TRT-LLM when you've found your model.

6. Default → vLLM. Broadest coverage, lowest ops overhead, still excellent performance. The vast majority of production deployments running today use vLLM.

Migrating between engines#

One practical consideration: migrating from vLLM to TensorRT-LLM mid-stream is not trivial. Expect to invest 1–2 sprints in:

Building the engine CI pipeline (build, cache, version .engine files)
Validating output quality against your evaluation benchmarks (quantization + compilation can shift outputs)
Adapting request handling if you relied on vLLM-specific API features

Plan for this cost in your architecture decision. If your team is small or moving fast, the migration cost alone is often sufficient reason to stay on vLLM until scale demands it.

Try it: GPU Selection Advisor →

Get a ranked GPU recommendation to pair with your chosen inference engine based on your model and workload.

Try it: VRAM Calculator →

Calculate the exact VRAM needed for your model + KV cache to validate your engine + GPU pairing.

Bottom line#

The feature gap between vLLM, SGLang, and TensorRT-LLM has largely closed. All three are production-grade. The decision comes down to a single honest question: how much operational complexity can you absorb for how much performance gain?

vLLM is the right answer for 80% of teams. SGLang is the right answer for high-concurrency MoE workloads and structured generation. TensorRT-LLM is the right answer when you have a stable, high-volume NVIDIA deployment and engineering bandwidth to maintain a compiled engine pipeline.

Don't optimize prematurely — benchmark your actual workload, not synthetic benchmarks, and migrate only when you have profiled evidence that the engine is your bottleneck.

Key Takeaway

Start with vLLM — it is the safest default with the broadest coverage and the lowest operational overhead. Move to SGLang for large MoE or high-concurrency structured generation. Move to TensorRT-LLM when you have a stable model, NVIDIA-only hardware, and profiled evidence that the 15–25% throughput gain justifies the engine build overhead. Never make this decision based on benchmarks run on different hardware or different workloads than yours.

Frequently asked questions

Is vLLM or SGLang faster?

It depends on the workload. In our H100 benchmarks across 5,980 requests, SGLang edges ahead on high-concurrency throughput for some workloads while vLLM is competitive and broader in model support. For raw single-engine peak performance on NVIDIA hardware, TensorRT-LLM usually wins but is harder to set up.

Should I use TensorRT-LLM for production?

Use TensorRT-LLM when you need maximum performance on NVIDIA GPUs and can invest in engine compilation and tuning. For faster iteration, broad model coverage, and easier ops, vLLM or SGLang are usually the better starting point.

Do all three support continuous batching?

Yes. vLLM, SGLang, and TensorRT-LLM all support continuous (in-flight) batching, paged KV cache, quantization, speculative decoding, and prefix caching out of the box. The differences are in performance, hardware breadth, and developer experience.

Which inference engine has the best hardware support?

vLLM has the broadest support (NVIDIA GPUs, AMD, TPU, and more). SGLang supports NVIDIA and AMD. TensorRT-LLM is NVIDIA-only but extracts the most performance from that hardware.

Keep learning

Chapter 4: vLLM, SGLang & TensorRT-LLM→vLLM vs SGLang Benchmarks→GPU Selection Advisor→Serving Framework Wizard→