Metrics

SLO (Service Level Objective)

A target performance threshold (e.g., p95 TTFT < 500 ms, TPS > 30) that a production LLM system must meet to satisfy quality-of-service requirements.

Definition

An SLO (Service Level Objective) defines the measurable performance targets a production system commits to — commonly expressed as percentile latency bounds (e.g., p99 TTFT < 1 s) or minimum throughput floors (e.g., TPS > 20 per client). In LLM serving, SLOs constrain the operating point on the latency-throughput Pareto curve: a tighter TTFT SLO limits how aggressively requests can be batched. Capacity planning involves ensuring enough GPU headroom so that traffic spikes do not violate SLOs. SLOs also inform autoscaling triggers and hardware selection.

TTFT Latency Throughput Chapter 7: Production

More Metrics terms

Arithmetic Intensity

The ratio of FLOPs to bytes of memory traffic for an operation, used to determine whether a workload is compute-bound or memory-bandwidth-bound.

Roofline Model

Visual performance model that shows achievable FLOP/s as a function of arithmetic intensity, with two ceilings: memory bandwidth and compute.

TTFT (Time to First Token)

Latency from sending the request to receiving the first generated token — primarily determined by prefill duration and queuing time.

Back to Glossary Start Reading — Chapter 0