Skip to content
Chapter 2

Models

Architecture and Bottlenecks

35 min read5 sections

Inference engineering is the practice of making generative AI models faster, less expensive, and more reliable — without sacrificing quality. Both improving performance and preserving quality require a strong intuition for how models work under the hood.

📖Transformers

A transformer is a neural network with an attention mechanism that can learn relationships between various parts of a sequence. Introduced in the 2017 paper "Attention Is All You Need," transformers are the foundation of generative AI across every modality.

Across modalities, there are two important styles of transformer-based models:

  • Autoregressive token generation: Start from a tokenized sequence and predict the most likely next token (LLMs)
  • Iterative denoising: Start from random noise and refine toward the most likely output via diffusion (image/video generation)

Neural Networks#

The fundamental unit of a neural network is a node (a.k.a., neuron). A node takes an input, multiplies it by some weights, adds some bias, and returns the result. A group of nodes forms a layer. The neural networks behind LLMs contain dozens to hundreds of layers across three types:

  • Input layer: The first layer, which accepts and processes input
  • Hidden layers: Every layer between first and last, which iteratively transform input to arrive at output
  • Output layer: The final layer, which returns the prediction

Each layer produces an output that the next layer reads as input. For hidden layers, these outputs are called hidden states.

There are neural networks for creating internal representations, and neural networks for using them:

  • Encoder: Takes an input like text or an image and creates an internal representation with additional semantic meaning
  • Decoder: Uses the internal representation to generate output like text or an image

Modern LLMs are decoder-only, while many models in other modalities use an encoder-decoder architecture. Whisper, for audio transcription, uses an encoder to process audio input and a decoder to generate text tokens.

Linear Layers and Matmul

The most essential operation within a neural network is a matrix multiplication (matmul). A matmul takes an input vector and a matrix and multiplies the vector through the matrix to produce an output vector.

Within a neural network, a linear layer is the simplest form of matmul. Given an input vector, the linear layer applies a weight matrix and adds a bias vector: y = Wx + b

Activation Functions

Matrix multiplication is composable — multiplying a vector by two matrices is equivalent to multiplying by the product of those matrices. This is a problem for multi-layer networks because a series of linear layers would collapse into a single layer.

Activation functions are non-linear functions that prevent this collapse while supporting back-propagation. The most basic is ReLU (Rectified Linear Unit): if X > 0, return X; else return 0.

Activation functions like ReLU, SiLU, Swish, and SwiGLU are fast to run, easy to train on, and break linearity to support multi-layer neural networks.

LLM Inference Mechanics#

LLMs are autoregressive token generation models. An LLM generates new tokens one at a time based on every previous token.

Tokens are the atomic units of language models — numbers that represent chunks of text. Modern LLMs use subword tokenization, meaning each token is a word or a fraction of a word. Converting text to tokens doesn't require neural networks — a tokenizer is a simple mapping between strings and their numerical representation.

Inference involves two or three sequences of tokens:

  • Input sequence: The prompt, chat, context, and other input
  • Reasoning sequence: Optionally, for reasoning models, an intermediate thinking output
  • Output sequence: The response generated by the LLM

There are two primary phases of inference:

  • Prefill: Process the input sequence to calculate attention for each input token and store values in a KV cache
  • Decode: Perform forward passes through the model to generate tokens autoregressively

Key Takeaway

Each decode forward pass generates a vector of logits — one per token in the vocabulary. After normalization, these represent the probability of each potential next token. The output token is selected via weighted random number generation, nudged by temperature, top-k, and top-p parameters.

LLM Architecture

Every LLM on Hugging Face includes a config.json file detailing the model's architecture. Within a single architecture, there may be:

  • Multiple sizes: Models at different parameter counts (e.g., Llama 8B and 70B)
  • Multiple variants: "base" and "instruct" variants share the same architecture
  • Unlimited fine-tunes: Methods like LoRA change behavior, not architecture

To parse an architecture name like Qwen3MoeForCausalLM:

  • Qwen: The model family
  • 3: The major version
  • MoE: Mixture of Experts model
  • CausalLM: A causal language model (predicts next token based on previous tokens)

Transformer Blocks

The main body of an LLM is a series of dozens to hundreds of transformer blocks with three kinds of layers:

  • Embedding layer: The input layer takes tokens and returns embeddings
  • Transformer blocks: Hidden layers that generate a prediction via attention, feed-forward neural networks, and normalization sublayers
  • Output layer (LMHead): Converts hidden states into logit vectors

The feed-forward neural network sublayers make up the majority of trainable weights, while attention sublayers are the second-largest component.

Attention

Attention is the mechanism transformers use to relate a given token to other tokens in the sequence.

The standard form is scaled dot-product attention with three inputs:

  • Q (queries): The embedded representation of the token being generated
  • K (keys): Representations of all prior tokens
  • V (values): Computed attention values for all prior tokens

Attention sublayers are multi-head — each head is one attention operation that could attend to different kinds of relationships (subject-verb agreement, co-reference resolution, etc.).

Two main types of attention:

  • Self-attention: Q, K, and V are all from the same sequence (used in LLMs)
  • Cross-attention: Q is from a different sequence than K and V (used in image generation and multimodal models)

Because attention relates the current token to every previous token, it's a quadratic-time equation with respect to sequence length. In practice, the KV cache stores key-value pairs for each previous token, making attention linear rather than quadratic.

⚠️The KV Cache

The KV cache is built during prefill, used and updated during decode, and lives on GPU memory by default. Storing, accessing, and re-using the KV cache is a major topic in inference engineering (Chapter 5).

Mixture of Experts Models

Mixture of Experts (MoE) is an architecture that adds sparsity to linear layers. Rather than a single giant matrix, an MoE model has hundreds of smaller matrices (experts) and routes each input to a small selection of experts.

For Qwen3-235B-A22B, 22 billion parameters out of 235 billion total are activated per request. MoE models are highly efficient for single-request local inference. However, in batched production inference, different requests activate different experts — you should expect almost all parameters to be active.

Expert routing is granular: the router (a tiny model within the LLM) picks which experts to activate at each layer for every token generated.

MoE architectures are especially popular for models with 100B+ parameters. Models under 32B parameters tend to use traditional dense architectures efficiently.

Image Generation Inference Mechanics#

Image generation models take a text prompt and create an image. Unlike LLMs, they are pipelines of multiple models:

  • Text encoder: Converts the text prompt into instructions the model can understand
  • Denoising model: The heart of the model — iterates from noise to an image based on the prompt
  • Variational Autoencoder (VAE): Converts output from latent space to pixel space

The entire pipeline operates in latent space — a low-dimensional representation of an image (e.g., 128x128 instead of 1024x1024 pixels). The denoising model refines random noise into an image over 30-50 denoising steps.

Key inference arguments:

ArgumentDescription
PromptDescribes what the image should look like
Negative promptStyles or objects that should not appear
Number of stepsSpeed vs. quality tradeoff (30-50 typical)
Guidance scaleBalance between creativity and prompt adherence
Image sizeFixed menu of resolutions and aspect ratios

Image Generation Model Architecture

Image generation models are built on diffusion transformers — similar to LLM transformers but processing image data in patches (2x2 or 4x4 pixels embedded into latent space).

The latest research blends diffusion transformer architecture with LLM architecture. Anything that can be tokenized can be modeled as an LLM, and LLMs solve many problems inherent to diffusion models — variable-length output and single-forward-pass generation.

Few-Step Image Generation Models

Rather than making each step faster, few-step models create high-resolution images with eight or fewer denoising steps — 80 to 90 percent faster, with noticeably lower quality.

Two primary methods: latent consistency (predict target latent directly) and distillation (adversarial/progressive distillation to emulate a larger model).

Video Generation

Video generation models are architecturally similar to image generation models, just bigger — 3-5x more parameters and 10-100x more information in latent space. Modern video models hold the entire video in latent space (X, Y, and T dimensions) and modify it on each denoising step.

Video generation is so compute-intensive that models typically run with a batch size of one — a full node of eight GPUs works on a single request.

Calculating Inference Bottlenecks#

GPUs have two main resources: compute (FLOPS) and memory bandwidth (bytes/s). In a perfectly optimized system, both are fully utilized. In reality, systems have bottlenecks:

  • LLM prefill (KV cache construction) is compute bound
  • LLM decode (token generation) is memory bound
  • Image and video generation are compute bound

Ops:Byte Ratio and Arithmetic Intensity

Each GPU has a specific ops:byte ratio. For an H100 in FP16: 989 teraFLOPS / 3.35 TB/s = ~295 ops per byte.

Arithmetic intensity is the ratio between work and memory traffic for a calculation:

Arithmetic Intensity = FLOPs / Bytes Accessed

Plotting against a roofline model reveals whether an algorithm is:

  • Compute bound: Arithmetic intensity > hardware ops:byte ratio (hits performance ceiling)
  • Memory bound: Arithmetic intensity < hardware ops:byte ratio (hits bandwidth ceiling)

Key Takeaway

The key difference: prefill processes the entire input sequence in parallel (load weights once, many matrix-matrix multiplications = high arithmetic intensity), while decode generates tokens one at a time (load weights for each token, vector-matrix multiplication = low arithmetic intensity).

Try it: Arithmetic Intensity Calculator

Calculate the ops:byte ratio and plot on a roofline model.

LLM Inference Bottlenecks

  • Prefill: Model weights loaded once, then large matrix multiplications against the input matrix. High arithmetic intensity → compute bound.
  • Decode: Model weights loaded for every token, with less-expensive vector-matrix multiplication. Low arithmetic intensity → memory bound.

When optimizing performance, the goal is to make the bottleneck less limiting. For example, batching multiple requests makes decode less memory bound because processing a batch uses more compute for the same memory traffic.

Image Generation Inference Bottlenecks

Image and video generation are compute bound due to the massive matrix multiplications involved in processing latent space across multiple denoising steps. Optimizing compute throughput (more FLOPS) directly improves generation speed.

Optimizing Attention#

Attention is the most expensive operation in inference across both LLMs and image/video generation. As context lengths grow and image resolutions increase, optimizing attention becomes increasingly critical. The full treatment of attention optimization techniques is in Chapter 5.

Check Your Understanding

1 / 12

What is the purpose of activation functions in neural networks?