Latency
End-to-end response time from a client's perspective, encompassing network, queuing, prefill, and decode phases.
Definition
Latency is the wall-clock time a client waits from sending a request to receiving the complete response. It decomposes into: network round-trip, scheduling queue wait, TTFT (prefill), and decode time (output-length × TPOT). Streaming interfaces expose TTFT and TPOT as separate user-observable metrics, while batch APIs report total latency. Tail latency (p95/p99) is often more important than median latency in production because a single slow request degrades the user experience; techniques like scheduling policies, speculative decoding, and pre-warming reduce tail latency.
Related
More Metrics terms
Arithmetic Intensity
The ratio of FLOPs to bytes of memory traffic for an operation, used to determine whether a workload is compute-bound or memory-bandwidth-bound.
Roofline Model
Visual performance model that shows achievable FLOP/s as a function of arithmetic intensity, with two ceilings: memory bandwidth and compute.
TTFT (Time to First Token)
Latency from sending the request to receiving the first generated token — primarily determined by prefill duration and queuing time.