Benchmark Comparison
Data as of 2025-12-05Compare vLLM vs SGLang performance on DeepSeek-R1-Distill-Llama-8B across workloads and concurrency levels. Results collected June 2026 on 2× H100 SXM with TP=2. Click any row to expand details.
Source: vllm-vs-sglang-performance-benchmark — 2x H100 SXM, TP=2, CUDA 12.9, 5,980 total requests
How to read these numbers
TPS — tokens per second across all concurrent requests. Higher is better. Measures hardware utilization. Look here for cost-per-token decisions.
TTFT — time to first token, in milliseconds. Lower is better. Drives interactive UX. Anything <500 ms feels instant; >2 s feels broken.
p50 latency — median request latency. What a typical user experiences. Lower is better.
p99 latency — 99th-percentile request latency (the long tail). What your worst 1% of users experience. Lower is better. SLOs live here.
Workload note: results vary by prompt length, output length, and concurrency. Filter by workload below to compare apples to apples. New to inference engines? Read Chapter 4: Software for context on vLLM, SGLang, and TensorRT-LLM. Need to size your own deployment? Try the VRAM Calculator.
Metric:
Framework:
Model:
GPU:
Workload:
18 results
Tokens / Second