54 terms

Inference Engineering Glossary

Every key concept in LLM inference — from GPU memory hierarchy and attention variants to quantization algorithms, serving frameworks, and production SLOs — defined clearly and cross-linked to the interactive chapters.

Architecture

20 terms

KV Cache

GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.

Multi-Head Attention (MHA)

Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.

Grouped-Query Attention (GQA)

Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.

Multi-Query Attention (MQA)

Extreme attention variant using a single shared K/V head for all query heads, minimising KV cache at the cost of some model quality.

Mixture of Experts (MoE)

Architecture where each token is routed to a sparse subset of specialist feed-forward layers, enabling large parameter counts at low active-parameter cost.

Context Window

The maximum number of tokens a model can attend to in a single forward pass, encompassing both the prompt and generated output.

Embedding

Dense vector representation of a token in a high-dimensional space, learned during training to encode semantic and syntactic relationships.

Tokenization

The process of splitting text into discrete tokens that form the model's vocabulary, typically via byte-pair encoding (BPE) or similar algorithms.

Logits

Raw, unnormalised scores over the vocabulary produced by the model's final linear layer before softmax is applied.

Disaggregated Inference

Architecture separating prefill and decode work onto different GPU pools, allowing each phase to be independently scaled and optimised.

Tensor Parallelism

Distributing individual weight matrices across multiple GPUs so each GPU computes a column/row shard, requiring all-reduce after each layer.

Pipeline Parallelism

Multi-node strategy that assigns consecutive Transformer layer groups to different GPU nodes, passing activations between stages over the network.

Expert Parallelism

Parallelism strategy for MoE models that places different expert FFN modules on different GPUs, routing tokens via all-to-all communication.

Prefill Phase

The initial forward pass that processes the full input prompt in parallel, producing the first output token and populating the KV cache.

Decode Phase

The iterative token-by-token generation phase that follows prefill, where each step extends the KV cache by one row and is memory-bandwidth-bound.

Sampling

Stochastic token selection from the model's output probability distribution, as opposed to greedy (argmax) decoding.

Temperature

A scalar applied to logits before softmax that controls output randomness: values <1 sharpen the distribution, values >1 flatten it.

Beam Search

Deterministic decoding that maintains the top-K highest-probability partial sequences at each step, used in translation but rarely in modern LLM chat.

Activations

Intermediate tensor values computed during a model's forward pass, held in VRAM transiently and discarded after each decode step.

Model Weights

The trained parameter tensors of an LLM (embeddings, attention projections, MLP layers) loaded into VRAM at startup and kept resident throughout serving.

Hardware

6 terms

HBM (High Bandwidth Memory)

3D-stacked DRAM technology used in data-centre GPUs, offering memory bandwidth 5–10× higher than GDDR at the cost of smaller capacity.

VRAM

Video RAM — the GPU's dedicated on-chip memory (HBM on datacenter GPUs) holding model weights, KV cache, and activations during inference.

Memory Bandwidth

The rate at which data can be read from or written to GPU memory, measured in TB/s — the primary bottleneck during autoregressive LLM decoding.

FLOPS

Floating-point operations per second — the peak compute throughput of a GPU, determining how fast compute-bound operations (like prefill) run.

Tensor Core

Specialised matrix-multiply-accumulate units in NVIDIA GPUs that execute fused D=A×B+C on tiles at up to 16× the throughput of CUDA cores.

NVLink

NVIDIA's high-speed GPU-to-GPU interconnect, delivering up to 900 GB/s bidirectional bandwidth on H100 NVLink 4.0 for tensor and pipeline parallelism.

Software

8 terms

PagedAttention

vLLM's technique for storing KV cache in non-contiguous memory pages, eliminating fragmentation and enabling larger effective batch sizes.

FlashAttention

IO-aware exact attention algorithm that tiles computation to stay in SRAM, cutting HBM reads/writes and speeding up attention by 2–4×.

Continuous Batching

Scheduling technique that adds new requests to a running batch as soon as any sequence finishes, maximising GPU utilisation compared to static batching.

vLLM

Open-source LLM inference and serving library from UC Berkeley featuring PagedAttention, continuous batching, and broad model support.

SGLang

LLM inference runtime from Stanford LMSYS featuring RadixAttention, speculative execution, and structured generation support.

TensorRT-LLM

NVIDIA's optimised inference library for LLMs, generating highly tuned CUDA kernels via TensorRT with support for FP8, AWQ, and multi-GPU serving.

llama.cpp

CPU-first inference library in C/C++ enabling quantised LLM inference on consumer hardware without a GPU.

Text Generation Inference (TGI)

Hugging Face's production LLM serving toolkit with continuous batching, tensor parallelism, and a Rust-based HTTP server.

Optimization

13 terms

Quantization

Reducing model weight (and optionally activation) precision from FP16/BF16 to INT8, FP8, or INT4 to cut VRAM and increase throughput.

INT8

8-bit integer quantization for model weights and/or activations, roughly halving memory vs. FP16 with small accuracy degradation.

FP8

8-bit floating-point format (E4M3 or E5M2) natively supported on H100/H200 GPUs, enabling faster matmuls with minimal accuracy loss vs. FP16.

AWQ (Activation-aware Weight Quantization)

Weight-only quantization method that protects the 1% of channels salient for output quality, enabling accurate 4-bit inference.

GPTQ

Layer-wise weight quantization using second-order Hessian information to minimise quantization error, supporting 4-bit and 8-bit precisions.

SmoothQuant

W8A8 quantization technique that migrates quantization difficulty from activations to weights via a mathematically equivalent per-channel scaling.

Speculative Decoding

Latency technique where a small draft model proposes token sequences the target model verifies in parallel, typically cutting TTFT/latency by 2–3×.

Draft Model

Small, fast auxiliary model used in speculative decoding to propose candidate tokens for the larger target model to verify.

EAGLE

Speculative decoding variant with a lightweight head on target-model hidden states, achieving higher draft acceptance rates than token-level draft models.

Medusa

Speculative decoding approach adding multiple prediction heads to the base model, each independently predicting tokens K steps ahead.

Prefix Caching

Reusing KV cache blocks computed for a shared prompt prefix across multiple requests, eliminating redundant prefill computation.

RadixAttention

SGLang's generalised prefix caching using an LRU radix tree to reuse any shared KV cache subtree across requests, not just common prefixes.

Chunked Prefill

Breaking large prefill requests into smaller token chunks so the GPU can interleave prefill and decode work, reducing TTFT for waiting decode requests.

Metrics

7 terms

Arithmetic Intensity

The ratio of FLOPs to bytes of memory traffic for an operation, used to determine whether a workload is compute-bound or memory-bandwidth-bound.

Roofline Model

Visual performance model that shows achievable FLOP/s as a function of arithmetic intensity, with two ceilings: memory bandwidth and compute.

TTFT (Time to First Token)

Latency from sending the request to receiving the first generated token — primarily determined by prefill duration and queuing time.

TPS / TPOT (Tokens per Second / Time per Output Token)

Output throughput metrics: TPS measures tokens generated per second, TPOT measures milliseconds between successive output tokens.

Throughput

Total tokens processed (input + output) per second across all concurrent requests — a key measure of serving efficiency and cost.

Latency

End-to-end response time from a client's perspective, encompassing network, queuing, prefill, and decode phases.

SLO (Service Level Objective)

A target performance threshold (e.g., p95 TTFT < 500 ms, TPS > 30) that a production LLM system must meet to satisfy quality-of-service requirements.

Want a printable reference? All 8 cheat sheets cover these concepts with formulas and GPU specs.

Start Reading — Chapter 0 Get the Free PDF