Skip to content
Metrics

Latency

End-to-end response time from a client's perspective, encompassing network, queuing, prefill, and decode phases.

Definition

Latency is the wall-clock time a client waits from sending a request to receiving the complete response. It decomposes into: network round-trip, scheduling queue wait, TTFT (prefill), and decode time (output-length × TPOT). Streaming interfaces expose TTFT and TPOT as separate user-observable metrics, while batch APIs report total latency. Tail latency (p95/p99) is often more important than median latency in production because a single slow request degrades the user experience; techniques like scheduling policies, speculative decoding, and pre-warming reduce tail latency.

More Metrics terms