Skip to content
Metrics

Throughput

Total tokens processed (input + output) per second across all concurrent requests — a key measure of serving efficiency and cost.

Definition

Throughput in LLM serving refers to the total token processing rate of a server, aggregating across all parallel requests. High throughput implies efficient hardware utilisation: the GPU is kept busy generating tokens rather than sitting idle waiting for requests. Maximising throughput is the goal of techniques like continuous batching and PagedAttention. Throughput and latency are often at odds: batching more requests improves throughput but can increase per-request latency, so production systems operate at a Pareto-optimal trade-off point governed by their SLOs.

More Metrics terms