Skip to content
54 terms

Inference Engineering Glossary

Every key concept in LLM inference — from GPU memory hierarchy and attention variants to quantization algorithms, serving frameworks, and production SLOs — defined clearly and cross-linked to the interactive chapters.

Architecture

20 terms

KV Cache

GPU memory buffer storing attention key/value tensors so they need not be recomputed for tokens already processed.

Multi-Head Attention (MHA)

Standard Transformer attention where every layer maintains separate Q, K, V projections for each attention head.

Grouped-Query Attention (GQA)

Attention variant that shares K/V heads across groups of query heads, shrinking KV cache size while retaining most of MHA's expressiveness.

Multi-Query Attention (MQA)

Extreme attention variant using a single shared K/V head for all query heads, minimising KV cache at the cost of some model quality.

Mixture of Experts (MoE)

Architecture where each token is routed to a sparse subset of specialist feed-forward layers, enabling large parameter counts at low active-parameter cost.

Context Window

The maximum number of tokens a model can attend to in a single forward pass, encompassing both the prompt and generated output.

Embedding

Dense vector representation of a token in a high-dimensional space, learned during training to encode semantic and syntactic relationships.

Tokenization

The process of splitting text into discrete tokens that form the model's vocabulary, typically via byte-pair encoding (BPE) or similar algorithms.

Logits

Raw, unnormalised scores over the vocabulary produced by the model's final linear layer before softmax is applied.

Disaggregated Inference

Architecture separating prefill and decode work onto different GPU pools, allowing each phase to be independently scaled and optimised.

Tensor Parallelism

Distributing individual weight matrices across multiple GPUs so each GPU computes a column/row shard, requiring all-reduce after each layer.

Pipeline Parallelism

Multi-node strategy that assigns consecutive Transformer layer groups to different GPU nodes, passing activations between stages over the network.

Expert Parallelism

Parallelism strategy for MoE models that places different expert FFN modules on different GPUs, routing tokens via all-to-all communication.

Prefill Phase

The initial forward pass that processes the full input prompt in parallel, producing the first output token and populating the KV cache.

Decode Phase

The iterative token-by-token generation phase that follows prefill, where each step extends the KV cache by one row and is memory-bandwidth-bound.

Sampling

Stochastic token selection from the model's output probability distribution, as opposed to greedy (argmax) decoding.

Temperature

A scalar applied to logits before softmax that controls output randomness: values <1 sharpen the distribution, values >1 flatten it.

Beam Search

Deterministic decoding that maintains the top-K highest-probability partial sequences at each step, used in translation but rarely in modern LLM chat.

Activations

Intermediate tensor values computed during a model's forward pass, held in VRAM transiently and discarded after each decode step.

Model Weights

The trained parameter tensors of an LLM (embeddings, attention projections, MLP layers) loaded into VRAM at startup and kept resident throughout serving.

Hardware

6 terms

Software

8 terms

Optimization

13 terms

Quantization

Reducing model weight (and optionally activation) precision from FP16/BF16 to INT8, FP8, or INT4 to cut VRAM and increase throughput.

INT8

8-bit integer quantization for model weights and/or activations, roughly halving memory vs. FP16 with small accuracy degradation.

FP8

8-bit floating-point format (E4M3 or E5M2) natively supported on H100/H200 GPUs, enabling faster matmuls with minimal accuracy loss vs. FP16.

AWQ (Activation-aware Weight Quantization)

Weight-only quantization method that protects the 1% of channels salient for output quality, enabling accurate 4-bit inference.

GPTQ

Layer-wise weight quantization using second-order Hessian information to minimise quantization error, supporting 4-bit and 8-bit precisions.

SmoothQuant

W8A8 quantization technique that migrates quantization difficulty from activations to weights via a mathematically equivalent per-channel scaling.

Speculative Decoding

Latency technique where a small draft model proposes token sequences the target model verifies in parallel, typically cutting TTFT/latency by 2–3×.

Draft Model

Small, fast auxiliary model used in speculative decoding to propose candidate tokens for the larger target model to verify.

EAGLE

Speculative decoding variant with a lightweight head on target-model hidden states, achieving higher draft acceptance rates than token-level draft models.

Medusa

Speculative decoding approach adding multiple prediction heads to the base model, each independently predicting tokens K steps ahead.

Prefix Caching

Reusing KV cache blocks computed for a shared prompt prefix across multiple requests, eliminating redundant prefill computation.

RadixAttention

SGLang's generalised prefix caching using an LRU radix tree to reuse any shared KV cache subtree across requests, not just common prefixes.

Chunked Prefill

Breaking large prefill requests into smaller token chunks so the GPU can interleave prefill and decode work, reducing TTFT for waiting decode requests.

Metrics

7 terms

Want a printable reference? All 8 cheat sheets cover these concepts with formulas and GPU specs.