Benchmark Comparison

Data as of 2025-12-05

Compare vLLM vs SGLang performance on DeepSeek-R1-Distill-Llama-8B across workloads and concurrency levels. Results collected June 2026 on 2× H100 SXM with TP=2. Click any row to expand details.

Source: vllm-vs-sglang-performance-benchmark — 2x H100 SXM, TP=2, CUDA 12.9, 5,980 total requests

How to read these numbers

TPS — tokens per second across all concurrent requests. Higher is better. Measures hardware utilization. Look here for cost-per-token decisions.

TTFT — time to first token, in milliseconds. Lower is better. Drives interactive UX. Anything <500 ms feels instant; >2 s feels broken.

p50 latency — median request latency. What a typical user experiences. Lower is better.

p99 latency — 99th-percentile request latency (the long tail). What your worst 1% of users experience. Lower is better. SLOs live here.

Workload note: results vary by prompt length, output length, and concurrency. Filter by workload below to compare apples to apples. New to inference engines? Read Chapter 4: Software for context on vLLM, SGLang, and TensorRT-LLM. Need to size your own deployment? Try the VRAM Calculator.

Metric:

Framework:

Model:

GPU:

Workload:

18 results

Tokens / Second