Inference Engineering Glossary
Every key concept in LLM inference — from GPU memory hierarchy and attention variants to quantization algorithms, serving frameworks, and production SLOs — defined clearly and cross-linked to the interactive chapters.
Architecture
20 termsKV Cache
GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.
Multi-Head Attention (MHA)
Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.
Grouped-Query Attention (GQA)
Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.
Multi-Query Attention (MQA)
Extreme attention variant using a single shared K/V head for all query heads, minimising KV cache at the cost of some model quality.
Mixture of Experts (MoE)
Architecture where each token is routed to a sparse subset of specialist feed-forward layers, enabling large parameter counts at low active-parameter cost.
Context Window
The maximum number of tokens a model can attend to in a single forward pass, encompassing both the prompt and generated output.
Embedding
Dense vector representation of a token in a high-dimensional space, learned during training to encode semantic and syntactic relationships.
Tokenization
The process of splitting text into discrete tokens that form the model's vocabulary, typically via byte-pair encoding (BPE) or similar algorithms.
Logits
Raw, unnormalised scores over the vocabulary produced by the model's final linear layer before softmax is applied.
Disaggregated Inference
Architecture separating prefill and decode work onto different GPU pools, allowing each phase to be independently scaled and optimised.
Tensor Parallelism
Distributing individual weight matrices across multiple GPUs so each GPU computes a column/row shard, requiring all-reduce after each layer.
Pipeline Parallelism
Multi-node strategy that assigns consecutive Transformer layer groups to different GPU nodes, passing activations between stages over the network.
Expert Parallelism
Parallelism strategy for MoE models that places different expert FFN modules on different GPUs, routing tokens via all-to-all communication.
Prefill Phase
The initial forward pass that processes the full input prompt in parallel, producing the first output token and populating the KV cache.
Decode Phase
The iterative token-by-token generation phase that follows prefill, where each step extends the KV cache by one row and is memory-bandwidth-bound.
Sampling
Stochastic token selection from the model's output probability distribution, as opposed to greedy (argmax) decoding.
Temperature
A scalar applied to logits before softmax that controls output randomness: values <1 sharpen the distribution, values >1 flatten it.
Beam Search
Deterministic decoding that maintains the top-K highest-probability partial sequences at each step, used in translation but rarely in modern LLM chat.
Activations
Intermediate tensor values computed during a model's forward pass, held in VRAM transiently and discarded after each decode step.
Model Weights
The trained parameter tensors of an LLM (embeddings, attention projections, MLP layers) loaded into VRAM at startup and kept resident throughout serving.
Hardware
6 termsHBM (High Bandwidth Memory)
3D-stacked DRAM technology used in data-centre GPUs, offering memory bandwidth 5–10× higher than GDDR at the cost of smaller capacity.
VRAM
Video RAM — the GPU's dedicated on-chip memory (HBM on datacenter GPUs) holding model weights, KV cache, and activations during inference.
Memory Bandwidth
The rate at which data can be read from or written to GPU memory, measured in TB/s — the primary bottleneck during autoregressive LLM decoding.
FLOPS
Floating-point operations per second — the peak compute throughput of a GPU, determining how fast compute-bound operations (like prefill) run.
Tensor Core
Specialised matrix-multiply-accumulate units in NVIDIA GPUs that execute fused D=A×B+C on tiles at up to 16× the throughput of CUDA cores.
NVLink
NVIDIA's high-speed GPU-to-GPU interconnect, delivering up to 900 GB/s bidirectional bandwidth on H100 NVLink 4.0 for tensor and pipeline parallelism.
Software
8 termsPagedAttention
vLLM's technique for storing KV cache in non-contiguous memory pages, eliminating fragmentation and enabling larger effective batch sizes.
FlashAttention
IO-aware exact attention algorithm that tiles computation to stay in SRAM, cutting HBM reads/writes and speeding up attention by 2–4×.
Continuous Batching
Scheduling technique that adds new requests to a running batch as soon as any sequence finishes, maximising GPU utilisation compared to static batching.
vLLM
Open-source LLM inference and serving library from UC Berkeley featuring PagedAttention, continuous batching, and broad model support.
SGLang
LLM inference runtime from Stanford LMSYS featuring RadixAttention, speculative execution, and structured generation support.
TensorRT-LLM
NVIDIA's optimised inference library for LLMs, generating highly tuned CUDA kernels via TensorRT with support for FP8, AWQ, and multi-GPU serving.
llama.cpp
CPU-first inference library in C/C++ enabling quantised LLM inference on consumer hardware without a GPU.
Text Generation Inference (TGI)
Hugging Face's production LLM serving toolkit with continuous batching, tensor parallelism, and a Rust-based HTTP server.
Optimization
13 termsQuantization
Reducing model weight (and optionally activation) precision from FP16/BF16 to INT8, FP8, or INT4 to cut VRAM and increase throughput.
INT8
8-bit integer quantization for model weights and/or activations, roughly halving memory vs. FP16 with small accuracy degradation.
FP8
8-bit floating-point format (E4M3 or E5M2) natively supported on H100/H200 GPUs, enabling faster matmuls with minimal accuracy loss vs. FP16.
AWQ (Activation-aware Weight Quantization)
Weight-only quantization method that protects the 1% of channels salient for output quality, enabling accurate 4-bit inference.
GPTQ
Layer-wise weight quantization using second-order Hessian information to minimise quantization error, supporting 4-bit and 8-bit precisions.
SmoothQuant
W8A8 quantization technique that migrates quantization difficulty from activations to weights via a mathematically equivalent per-channel scaling.
Speculative Decoding
Latency technique where a small draft model proposes token sequences the target model verifies in parallel, typically cutting TTFT/latency by 2–3×.
Draft Model
Small, fast auxiliary model used in speculative decoding to propose candidate tokens for the larger target model to verify.
EAGLE
Speculative decoding variant with a lightweight head on target-model hidden states, achieving higher draft acceptance rates than token-level draft models.
Medusa
Speculative decoding approach adding multiple prediction heads to the base model, each independently predicting tokens K steps ahead.
Prefix Caching
Reusing KV cache blocks computed for a shared prompt prefix across multiple requests, eliminating redundant prefill computation.
RadixAttention
SGLang's generalised prefix caching using an LRU radix tree to reuse any shared KV cache subtree across requests, not just common prefixes.
Chunked Prefill
Breaking large prefill requests into smaller token chunks so the GPU can interleave prefill and decode work, reducing TTFT for waiting decode requests.
Metrics
7 termsArithmetic Intensity
The ratio of FLOPs to bytes of memory traffic for an operation, used to determine whether a workload is compute-bound or memory-bandwidth-bound.
Roofline Model
Visual performance model that shows achievable FLOP/s as a function of arithmetic intensity, with two ceilings: memory bandwidth and compute.
TTFT (Time to First Token)
Latency from sending the request to receiving the first generated token — primarily determined by prefill duration and queuing time.
TPS / TPOT (Tokens per Second / Time per Output Token)
Output throughput metrics: TPS measures tokens generated per second, TPOT measures milliseconds between successive output tokens.
Throughput
Total tokens processed (input + output) per second across all concurrent requests — a key measure of serving efficiency and cost.
Latency
End-to-end response time from a client's perspective, encompassing network, queuing, prefill, and decode phases.
SLO (Service Level Objective)
A target performance threshold (e.g., p95 TTFT < 500 ms, TPS > 30) that a production LLM system must meet to satisfy quality-of-service requirements.
Want a printable reference? All 8 cheat sheets cover these concepts with formulas and GPU specs.