Optimization

LLM Inference Acceleration: Complete Guide

11 min readUpdated 2026-06-01

💡This is the quick acceleration techniques guide

This article covers every major technique with practical guidance on when each helps. For the complete deep-dive — including implementation details, quantization math, and speculative decoding internals — read Chapter 5: Quantization & Speculative Decoding and Chapter 4: vLLM, SGLang & TensorRT-LLM in Context.

TL;DR

LLM inference acceleration is a toolkit of complementary techniques targeting different bottlenecks: quantization reduces memory traffic (the dominant cost for decode), continuous batching increases GPU utilization, prefix caching eliminates repeated prefill computation, speculative decoding reduces latency at low batch sizes, and parallelism strategies scale beyond a single GPU. The order of application matters — quantization and continuous batching give the largest wins for the least effort and should be applied first. Speculative decoding, prefix caching, and disaggregation are high-value but workload-specific. Always profile before optimizing; the bottleneck changes with batch size and workload shape.

Key Facts

Quantization speedup (FP8 on H100): 1.5–2× decode throughput vs FP16; near-zero quality loss
Continuous batching win: 3–10× throughput vs static batching under mixed-length concurrent load
Speculative decoding speedup: 1.5–3× effective TPS at batch size ≤8; shrinks at higher batch
Prefix caching TTFT win: 90%+ TTFT reduction for repeated system prompts (near-zero recompute)
INT4 (AWQ) quality tradeoff: ~1–3% benchmark degradation; 4× memory reduction, ~2× decode speedup
Disaggregated prefill/decode: Reduces P99 TTFT by 40–60% under heavy load by isolating prefill spikes

LLM inference is expensive because it is fundamentally sequential (decode generates one token at a time) and memory-intensive (the full model is re-read from VRAM for every token). A well-understood toolkit of techniques attacks these constraints from different angles. But applying them in the wrong order wastes effort, and applying them without profiling your specific bottleneck can produce no improvement or even regressions.

This guide covers every major technique with enough depth to know why it works, when to reach for it, and how much to expect.

Know your bottleneck first#

Before applying any technique, you need to know what is actually limiting you. LLM inference has two phases with completely different performance profiles:

Prefill (prompt processing) — compute-bound. All prompt tokens are processed in parallel as a matrix multiplication. Tensor Cores are the bottleneck. Arithmetic intensity is high (~1,000 FLOP/byte for long prompts).
Decode (token generation) — memory-bandwidth-bound. One token is generated per step, requiring a full read of model weights from VRAM. Arithmetic intensity is ~1 FLOP/byte at batch size 1.

Most interactive workloads (chat, agents, code completion) are decode-dominated: TTFT matters but seconds of interaction time is dominated by generating 200–2,000 output tokens. Memory bandwidth is the binding constraint.

Batch processing workloads (document classification, bulk summarization) can become prefill-dominated or compute-limited at large batch sizes where the decode batch is large enough to shift arithmetic intensity above the machine balance point.

Practical diagnosis:

Is your GPU utilization <50% with requests queued? You're scheduling-bound or memory-capacity-bound (not enough VRAM for KV cache to batch aggressively).
Is TTFT your main complaint? Your prefill throughput is the bottleneck — look at chunked prefill and prefill compute optimization.
Is output tokens/second your complaint? You're bandwidth-bound on decode — quantize first.
Are you latency-constrained with low batch sizes? Speculative decoding can help.

1. Quantization — the highest-leverage starting point#

Quantization stores weights (and optionally activations and the KV cache) in lower precision than the training format. Its impact on inference is disproportionate because decode is memory-bandwidth-bound: halving the bytes per weight halves the memory traffic per token, which approximately halves decode latency at low batch sizes.

FP8 quantization (recommended default)

FP8 is a 1-byte floating-point format with either 4-bit exponent / 3-bit mantissa (E4M3, for weights and activations) or 5-bit exponent / 2-bit mantissa (E5M2, for gradient scaling). H100 introduced native FP8 Tensor Cores that run at 2× FP16 throughput.

Why FP8 is the right default:

Quality loss < 1% on standard benchmarks (MMLU, HumanEval, HellaSwag) when using per-tensor dynamic scaling
2× throughput vs FP16 on H100/H200 Tensor Cores (the FP8 matmul path runs at 1,979 TFLOPS vs 989 TFLOPS FP16)
2× VRAM reduction vs FP16 — a 70B model drops from 140 GB to 70 GB, fitting on a single H100
Supported natively in vLLM, SGLang, TensorRT-LLM, and Hugging Face with simple flags

INT8 quantization (SmoothQuant, GPTQ-INT8)

INT8 stores weights as 8-bit integers with a per-channel or per-group scale factor. The primary algorithm in production is SmoothQuant (Xiao et al., 2022), which addresses the challenge that transformer activations have outlier channels with very large magnitudes that cause large quantization error if naively rounded to INT8.

SmoothQuant migrates quantization difficulty from activations to weights by scaling each channel: smooth activations are easy to quantize, and the scaling is absorbed into weights. Result: W8A8 (INT8 weights and activations) with FP8-comparable quality.

Quality vs FP8: comparable at 1% degradation. Throughput advantage: 1.5–1.8× vs FP16 (less than FP8's 2× because INT8 matmuls on H100 don't use the FP8 Tensor Core path).

INT4 quantization (AWQ, GPTQ)

INT4 stores each weight in 4 bits with a group quantization scale. The two main algorithms are:

GPTQ (Frantar et al., 2022): Layer-wise second-order optimization that minimizes quantization error per layer using the inverse Hessian. Fast to run (30–60 min for 70B), widely available via AutoGPTQ.

AWQ (Lin et al., 2023): Activation-aware weight quantization. Identifies the 1% of weight channels that are "salient" (corresponding to large activation magnitudes) and protects them with higher precision or scaling, dramatically reducing outlier quantization error. AWQ-INT4 consistently outperforms GPTQ-INT4 on reasoning and coding benchmarks.

Quality tradeoffs at INT4:

General benchmarks (MMLU, HellaSwag): 1–3% degradation
Reasoning/coding: 3–7% degradation for aggressive group sizes (g=64 or g=32)
Math/logic chains: up to 10% degradation — INT4 is risky for o1-style reasoning models

Speedup: 2× decode throughput vs FP16 (4× VRAM reduction → 4× less memory traffic → ~2× actual throughput, limited by other memory overhead).

KV cache quantization

The KV cache can be quantized independently of the model weights. FP8 KV cache reduces KV cache VRAM by 2× with minimal quality impact, directly enabling 2× longer contexts or 2× larger batch sizes for the same VRAM budget.

Try it: Quantization Quality Estimator →

Compare INT4/INT8/FP8 memory savings, throughput gain, and quality risk for your specific model and task type.

Quantization technique summary

Format	VRAM vs FP16	Approx decode speedup	Typical quality loss	Best for
FP8 (native H100)	0.5×	1.8–2.0×	<1%	Default production choice
INT8 W8A8 (SmoothQuant)	0.5×	1.5–1.8×	<1%	GPUs without FP8 Tensor Cores
INT4 AWQ (g=128)	0.25×	1.8–2.0×	1–2%	Memory-constrained deployments
INT4 GPTQ (g=64)	0.25×	1.8–2.0×	2–4%	Budget, less critical quality
INT4 AWQ (g=32)	0.25×	1.5–1.8×	3–7%	Extreme compression, verify quality
FP8 KV cache	KV at 0.5×	Context 2× longer	<1%	Long context, high concurrency

2. Continuous (in-flight) batching — free throughput#

Static batching waits for every sequence in a batch to reach its maximum length (or EOS token) before processing results and starting the next batch. If you have 8 requests, 7 generate 50 tokens and 1 generates 500 tokens, the GPU idles waiting for the one long sequence.

Continuous batching (Orca, Yu et al., 2022) inserts new requests into the running batch as existing requests complete. The scheduler maintains a pool of active sequences and fills empty slots immediately. The key insight: all active sequences share the same forward pass, so adding a new sequence only costs the marginal KV cache pages for that sequence, not a full batch restart.

Under realistic serving distributions (where output length varies by 10–100×), continuous batching increases GPU utilization from 30–60% (static) to 85–95%. The throughput gain is typically 3–10× depending on how variable your output lengths are.

This is table-stakes in every modern serving engine. If you're somehow still running static batching, enabling continuous batching is your first and most impactful action. It's free — no quality tradeoff.

3. PagedAttention and KV cache management#

The KV cache is the second-largest VRAM consumer after model weights. Poor KV cache management causes two problems: VRAM fragmentation (allocated but unusable memory) and premature sequence eviction (requests terminated when VRAM runs out).

PagedAttention (vLLM)

PagedAttention (Kwon et al., 2023) applies virtual memory concepts to KV cache management. Instead of pre-allocating a contiguous buffer per request (which causes fragmentation as requests grow at different rates), KV cache is stored in fixed-size pages (typically 16 tokens per page) that are allocated on demand and tracked via a page table.

Fragmentation with static allocation: ~40% of VRAM wasted. Fragmentation with PagedAttention: <4%. Result: ~2× more concurrent sequences per GPU, which translates directly to ~2× higher throughput at the same batch size.

Prefix caching

When multiple requests share a common prefix (system prompt, few-shot examples, multi-turn conversation history), computing the KV cache for that prefix is redundant work. Prefix caching stores the computed KV pages for a prefix and reuses them for any future request that starts with the same prefix.

When it wins:

System prompts: if you have a 500-token system prompt sent with every request, prefix caching eliminates 100% of the compute for that prefix on cache-hit requests. TTFT drops from 500-token prefill + N-token prefill to N-token prefill only.
Few-shot prompting: shared few-shot examples across requests share the prefix KV cache
Multi-turn conversations: each new user turn reuses all previous turns' KV cache

A typical hit rate of 70–90% on system prompt prefixes translates to 50–70% TTFT reduction for short user queries against long system prompts.

Try it: KV Cache Sizing Calculator →

Understand KV cache VRAM consumption at different concurrency levels, sequence lengths, and quantization settings.

RadixAttention (SGLang)

SGLang extends prefix caching to a full radix tree (trie) over all cached KV sequences. While standard prefix caching only matches at the start of a sequence, RadixAttention finds the longest common prefix path through the trie across all previously cached sequences — not just system prompts.

For workloads like multi-candidate generation (fork-and-verify) or structured output with many shared branches, RadixAttention cache hit rates approach 80–95%, far exceeding flat prefix caching.

4. Speculative decoding — latency at low batch sizes#

Speculative decoding (Leviathan et al., 2022; Chen et al., 2023) uses a small, fast draft model to propose K tokens speculatively, then the large target model verifies all K proposals in a single forward pass (which is parallel, not sequential).

Why it works: The target model's verification pass can check K tokens in the same time as generating 1 token, because the K draft tokens are processed as a batch in parallel during the verification step. If the acceptance rate is high, you get K accepted tokens for the cost of roughly 1 target model decode step.

Speedup math: If the draft model proposes K=4 tokens with acceptance rate α per token, the expected number of accepted tokens per verification step is:

E[accepted] = (1 - α^(K+1)) / (1 - α)

At α=0.8, K=4: E[accepted] = (1 - 0.8^5) / (1 - 0.8) ≈ 3.36 tokens per step.

Effective speedup ≈ E[accepted] / (cost_ratio), where cost_ratio accounts for draft model overhead. With a 7B draft and 70B target, the draft overhead is ~10% of target cost. Net speedup: ~2.5–3× at low batch sizes.

When to use it:

Low-to-medium batch sizes (≤8 concurrent requests): strongest benefit
The draft model acceptance rate must be high (>0.6) — requires a well-matched draft/target pair
Latency is the primary SLO (not throughput): at batch=1, speculative decoding is the best latency optimization available

When NOT to use it:

High batch sizes (>32): the target model is already efficiently utilizing GPU compute from the large batch. Adding draft overhead reduces net throughput.
When you can't find a good draft model: a poorly matched draft (low acceptance rate) actually slows you down by adding draft compute without useful accepted tokens.

Advanced speculative methods

EAGLE / EAGLE-2 (Li et al., 2024): Instead of a separate draft model, uses a lightweight draft head attached to the target model's feature representations. The draft head has access to the target model's internal features, leading to higher acceptance rates (0.85+) than a separate small draft model. EAGLE-2 adds speculative tree sampling for further gains.

Medusa (Cai et al., 2024): Multiple draft heads on the target model, each predicting different future token positions. Parallelizes draft generation. Lower acceptance rate than EAGLE but simpler to integrate.

Lookahead decoding (Fu et al., 2024): Draft tokens from n-gram patterns in the context, with no separate model. Zero draft compute cost, lower acceptance rates.

Try it: Speculative Decoding Simulator →

Model draft acceptance rates, draft lengths, and batch sizes to estimate real-world TPS improvement for your setup.

5. Chunked prefill#

Long input prompts cause prefill to monopolize the GPU for many milliseconds, blocking decode for other requests in the batch. A 4,096-token prompt at H100 speeds takes ~50ms to prefill, during which no decode steps happen — causing decode latency spikes for concurrent users.

Chunked prefill breaks long prefill operations into chunks (e.g., 512 tokens at a time) and interleaves them with decode steps. Decode steps run between prefill chunks, preventing prefill from starving concurrent decode.

Effect: P99 TTFT is more predictable (bounded by chunk size), and P99 decode token latency doesn't spike when long prompts arrive. Chunked prefill doesn't improve average throughput significantly — it improves latency variance and fairness.

6. Model parallelism strategies#

When a model is too large for a single GPU, you must distribute it across multiple GPUs. Three strategies exist, with different tradeoffs:

Tensor parallelism (TP)

Splits individual layer weight matrices across GPUs. For attention, the Q/K/V projection weights are split column-wise across GPUs; each GPU computes a partial attention output; an all-reduce synchronization reconstructs the full output.

Latency: Lowest of the three strategies — all GPUs work in parallel on each layer
Communication cost: All-reduce after every layer (2 × sequence_len × hidden_dim × N bytes per step)
Requires fast interconnect: NVLink or InfiniBand; PCIe all-reduce is prohibitively slow at TP≥4
Best for: Fitting large models within a node; latency-critical serving

Pipeline parallelism (PP)

Splits the model into stage blocks (e.g., 4 stages of 20 layers each) and assigns each stage to one or more GPUs. One GPU processes layers 1–20, passes the activations to GPU 2 for layers 21–40, etc.

Communication cost: Lower than TP — only activations pass between stage boundaries
"Bubbles": When one stage finishes its micro-batch, the next stage hasn't started yet (pipeline bubbles reduce utilization by ~5–15%)
Best for: Multi-node deployments where NVLink across nodes isn't available; better suited for throughput-optimized batch processing than latency-sensitive serving

Expert parallelism (EP)

Specific to mixture-of-experts models. Expert modules are distributed across GPUs, and each token is routed to its assigned expert GPU(s) via all-to-all communication.

Very high MoE throughput when expert load is balanced
Sensitive to routing imbalance — if many tokens route to the same expert, that GPU becomes a bottleneck
Best for: Large MoE models (DeepSeek-R1/V3, Mixtral) at scale; requires expert-aware infrastructure (SGLang has best EP support)

Practical guidance on combining strategies:

Within a node (NVLink): TP is primary; use TP=world_size to maximize within-node bandwidth
Across nodes: PP to minimize cross-node communication (all-reduce is expensive over InfiniBand)
Large MoE at datacenter scale: EP=N_experts or EP=N_GPUs with TP within each EP group

7. Disaggregated prefill/decode#

Traditional serving runs prefill and decode on the same hardware, which causes interference: a long prefill temporarily blocks decode for all other requests. More subtly, prefill-optimal hardware (high FLOPS, long prompt processing) and decode-optimal hardware (high bandwidth, low latency per token) are different.

Disaggregation runs prefill and decode on separate, independently scaled hardware pools. A prefill request hits a prefill server, gets processed in full, then the resulting KV cache is transferred to a decode server which handles all token generation.

Benefits:

Prefill spikes don't block decode — the decode pool runs continuously at high bandwidth utilization
Independent scaling — scale prefill capacity during high-throughput batch periods; scale decode capacity for concurrent users
Hardware matching — compute-optimized GPUs for prefill; bandwidth-optimized (H200) for decode
P99 TTFT improvement — under heavy load, disaggregated TTFT is 40–60% lower because prefill isn't queued behind other prefills sharing a GPU with ongoing decode

The cost: KV cache transfer between prefill and decode nodes adds latency (~5–20ms depending on network bandwidth and context length) and infrastructure complexity. NVIDIA Dynamo provides managed disaggregation as part of its inference platform.

Disaggregation is a production optimization for large-scale deployments (100+ concurrent users). It's not worth the complexity for small or medium deployments.

8. Flash Attention: an operator-level optimization you get for free#

Flash Attention (Dao et al., 2022; FlashAttention-2 2023; FlashAttention-3 for H100 in 2024) is an IO-aware exact attention implementation that dramatically reduces memory bandwidth consumption during the attention operation.

Standard attention computes softmax(QK^T)V, materializing the full [seq_len × seq_len] attention matrix in HBM (VRAM). For a 4,096-token sequence, that's 4,096² × 2 bytes = 32 MB per attention head per layer — and for multi-head attention with 32 heads over 32 layers, that's 32 GB of intermediate VRAM traffic per forward pass.

FlashAttention tiles this computation in SRAM (on-chip cache), never materializing the full attention matrix in HBM. Result: 5–20× reduction in HBM bandwidth consumption for the attention operation, enabling much longer contexts without an O(n²) memory blow-up.

In practice: FlashAttention is integrated into vLLM, SGLang, TensorRT-LLM, and PyTorch Flex Attention. You don't implement it — you get it automatically. Its importance is as a prerequisite for 32K+ context inference: without FlashAttention, long-context serving is memory-bandwidth-prohibitive.

Decision framework: which techniques to apply first#

Not all techniques stack additively. Apply in this order based on expected ROI:

Priority	Technique	Expected gain	Effort	When to apply
1	FP8 quantization	1.8–2× decode TPS	Low (flag)	Always, first
2	Continuous batching	3–10× throughput	None (default in engines)	Should already be on
3	Prefix caching	50–90% TTFT for repeated prefixes	Low (flag)	If shared system prompts
4	FP8 KV cache	2× context or concurrency	Low (flag)	Long context, high concurrency
5	Speculative decoding	1.5–3× TPS at low batch	Medium (find draft model)	Latency-constrained, batch≤8
6	Chunked prefill	Predictable P99 TTFT	Low (flag)	Mixed prompt lengths
7	Tensor parallelism	Linear with GPU count (near)	Medium-high (multi-GPU setup)	When model doesn't fit single GPU
8	Disaggregation	40–60% P99 TTFT at scale	High (infrastructure)	Large-scale production only
9	Expert parallelism	Best MoE throughput	High (engine support)	MoE models at scale
10	Pipeline parallelism	Scale across nodes	High	When TP across nodes isn't viable

A practical playbook for a typical production deployment:

Start with FP8 quantization — if your model has a published FP8 checkpoint (Llama 3, Qwen 2.5, Mistral) or auto-quantization is available in your engine, enable it. This is your single highest-ROI action. Validate quality on your eval set.
Enable continuous batching — it's the default in modern engines; confirm it's not disabled.
Enable prefix caching — if you have a system prompt or shared context. One flag in vLLM/SGLang.
Quantize the KV cache to FP8 — enables longer contexts or higher concurrency at same VRAM.
Profile actual workload — run load tests at realistic concurrency. Identify whether you're latency-bound (low batch, bandwidth-limited) or throughput-bound (queued requests, GPU-saturated).
Add speculative decoding — if P50/P99 output token latency is your SLO and batch sizes are low. Find a well-matched draft model; test acceptance rates on your prompt distribution.
Consider parallelism and scaling — when you've exhausted single-GPU optimizations or need to serve a model that doesn't fit on one GPU.

⚠️Measure on your workload, not synthetic benchmarks

Published throughput numbers are typically measured at high batch sizes (optimal for GPU utilization) with short, uniform prompts. Your production workload likely has variable prompt lengths, mixed batch sizes, and SLO requirements that change the tradeoffs. Benchmark every optimization against your production traffic distribution before committing.

Bottom line#

The fastest path to production-grade LLM serving efficiency is: FP8 quantize → enable continuous batching → enable prefix caching → profile → iterate. These three steps require minimal engineering effort (flags in modern engines) and together yield 3–10× throughput improvement over naïve FP16 static-batched serving.

The advanced techniques (speculative decoding, disaggregation, expert parallelism) have real value but higher marginal complexity. Apply them when you've profiled your workload and have a specific bottleneck they address.

The optimization landscape will keep moving. FP4 quantization (Blackwell), improved EAGLE-class speculative methods, and better prefix cache strategies are all active research areas. The framework for applying them — measure, identify bottleneck, apply the right tool — stays constant.

Key Takeaway

Start every optimization with FP8 quantization and continuous batching — these two techniques, available as flags in any modern serving engine, deliver the largest performance gains with the lowest engineering cost. Add prefix caching if you have shared prompts, speculative decoding if you're latency-bound at low batch sizes, and parallelism only when you've outgrown a single GPU. Always measure on your actual workload: the bottleneck changes with concurrency, and the techniques that win in benchmarks often aren't the bottleneck in your production system.

Frequently asked questions

What is the fastest way to speed up LLM inference?

The highest-leverage wins are usually: (1) use a quantized model (FP8 or INT4) to cut memory traffic, (2) enable continuous batching in your inference engine, and (3) use a faster engine/hardware combination. Speculative decoding helps for low-batch latency. The right lever depends on whether you're latency- or throughput-bound.

Does quantization hurt model quality?

Modern quantization (FP8, INT8 with SmoothQuant, INT4 with AWQ/GPTQ) typically loses under 1-3% on quality benchmarks while halving or quartering memory. The quality impact depends on the model and method; always evaluate on your own task.

How much faster is speculative decoding?

Speculative decoding can improve effective tokens-per-second by 1.5-3x for low-to-medium batch sizes, depending on the draft model's acceptance rate and overhead. At high batch sizes the benefit shrinks. Try our Speculative Decoding Simulator to estimate your speedup.

Keep learning

Chapter 5: Quantization & Speculative Decoding→Quantization Estimator→Speculative Decoding Simulator→KV Cache Calculator→