Metrics

Roofline Model

Visual performance model that shows achievable FLOP/s as a function of arithmetic intensity, with two ceilings: memory bandwidth and compute.

Definition

The roofline model plots an operation's achievable floating-point throughput against its arithmetic intensity on a log-log chart. Two ceilings bound performance: below the ridge point, throughput scales with memory bandwidth (the memory-bound regime); above it, throughput is capped by peak compute FLOPS. An operation's attained performance is the minimum of these two ceilings. By measuring where a kernel sits on the roofline, engineers can determine whether to invest in reducing memory traffic (e.g., kernel fusion, quantization) or increasing compute utilisation (e.g., better tiling).

Arithmetic Intensity FLOPS Memory Bandwidth Chapter 3: Hardware

More Metrics terms

Arithmetic Intensity

The ratio of FLOPs to bytes of memory traffic for an operation, used to determine whether a workload is compute-bound or memory-bandwidth-bound.

TTFT (Time to First Token)

Latency from sending the request to receiving the first generated token — primarily determined by prefill duration and queuing time.

TPS / TPOT (Tokens per Second / Time per Output Token)

Output throughput metrics: TPS measures tokens generated per second, TPOT measures milliseconds between successive output tokens.

Back to Glossary Start Reading — Chapter 0