Metrics

Throughput

Total tokens processed (input + output) per second across all concurrent requests — a key measure of serving efficiency and cost.

Definition

Throughput in LLM serving refers to the total token processing rate of a server, aggregating across all parallel requests. High throughput implies efficient hardware utilisation: the GPU is kept busy generating tokens rather than sitting idle waiting for requests. Maximising throughput is the goal of techniques like continuous batching and PagedAttention. Throughput and latency are often at odds: batching more requests improves throughput but can increase per-request latency, so production systems operate at a Pareto-optimal trade-off point governed by their SLOs.

Continuous Batching TPS / TPOT SLO

More Metrics terms

Arithmetic Intensity

The ratio of FLOPs to bytes of memory traffic for an operation, used to determine whether a workload is compute-bound or memory-bandwidth-bound.

Roofline Model

Visual performance model that shows achievable FLOP/s as a function of arithmetic intensity, with two ceilings: memory bandwidth and compute.

TTFT (Time to First Token)

Latency from sending the request to receiving the first generated token — primarily determined by prefill duration and queuing time.

Back to Glossary Start Reading — Chapter 0