Metrics

Latency

End-to-end response time from a client's perspective, encompassing network, queuing, prefill, and decode phases.

Definition

Latency is the wall-clock time a client waits from sending a request to receiving the complete response. It decomposes into: network round-trip, scheduling queue wait, TTFT (prefill), and decode time (output-length × TPOT). Streaming interfaces expose TTFT and TPOT as separate user-observable metrics, while batch APIs report total latency. Tail latency (p95/p99) is often more important than median latency in production because a single slow request degrades the user experience; techniques like scheduling policies, speculative decoding, and pre-warming reduce tail latency.

TTFT TPS / TPOT SLO

More Metrics terms

Arithmetic Intensity

The ratio of FLOPs to bytes of memory traffic for an operation, used to determine whether a workload is compute-bound or memory-bandwidth-bound.

Roofline Model

Visual performance model that shows achievable FLOP/s as a function of arithmetic intensity, with two ceilings: memory bandwidth and compute.

TTFT (Time to First Token)

Latency from sending the request to receiving the first generated token — primarily determined by prefill duration and queuing time.

Back to Glossary Start Reading — Chapter 0