LLM Inference Acceleration: Complete Guide
LLM inference is expensive and often slow. The good news: a well-understood toolkit of acceleration techniques can cut latency and cost dramatically. This guide covers every major technique, what it does, when it helps, and how much.
First, know your bottleneck#
You can't accelerate effectively without knowing what's limiting you. LLM inference has two phases:
- Prefill (processing the prompt) is compute-bound.
- Decode (generating tokens) is memory-bandwidth-bound.
Most interactive workloads are decode-dominated, which means the techniques that reduce memory traffic (quantization, KV cache optimization) tend to give the biggest wins. Throughput-oriented batch workloads benefit more from batching and parallelism.
Are you latency-bound (one user waiting) or throughput-bound (many concurrent users)? The answer changes which technique to reach for first. Optimizing the wrong dimension wastes effort.
1. Quantization — the highest-leverage win#
Quantization stores weights (and sometimes activations) in lower precision: FP8, INT8, or INT4 instead of FP16. Because decode is memory-bound, halving the bytes per weight roughly halves the memory traffic — directly speeding up token generation and freeing VRAM for larger batches or longer context.
| Format | Memory vs FP16 | Typical quality loss |
|---|---|---|
| FP8 | 0.5× | < 1% |
| INT8 (SmoothQuant) | 0.5× | < 1% |
| INT4 (AWQ / GPTQ) | 0.25× | 1-3% |
Try it: Quantization Quality Estimator →
Compare INT4/INT8/FP8 memory savings, speedup, and quality risk side by side.
2. Continuous batching — free throughput#
Traditional batching waits for every request in a batch to finish before starting the next. Continuous (in-flight) batching lets new requests join the running batch as soon as slots free up, dramatically improving GPU utilization under concurrent load.
This is table-stakes in modern engines (vLLM, SGLang, TensorRT-LLM all do it) — make sure it's enabled. It's one of the largest throughput wins available and costs you nothing in quality.
3. KV cache optimization#
The KV cache stores attention keys and values so the model doesn't recompute them for every token. It's essential — but it grows with sequence length and batch size and can dominate VRAM.
Key techniques:
- Paged KV cache (PagedAttention) — eliminates memory fragmentation, packing more requests into the same VRAM.
- Prefix caching — reuse the KV cache across requests sharing a prompt prefix (system prompts, few-shot examples, multi-turn chat). Huge TTFT win for repeated prefixes.
- KV cache quantization — store the cache in FP8/INT8 to fit longer contexts.
Try it: KV Cache Sizing Calculator →
Calculate KV cache memory for your model, sequence length, and batch size.
4. Speculative decoding — latency for low batches#
Speculative decoding uses a small, fast draft model to propose several tokens, which the large model then verifies in a single forward pass. When the draft is accepted, you get multiple tokens for the cost of one decode step.
It can improve effective TPS by 1.5-3× at low-to-medium batch sizes. The benefit shrinks at high batch sizes (where the large model is already well-utilized) and depends on the draft model's acceptance rate.
Try it: Speculative Decoding Simulator →
Adjust draft length and acceptance rate to estimate your real TPS speedup.
5. Model parallelism — when one GPU isn't enough#
When a model is too large for a single GPU, split it across several:
- Tensor parallelism (TP) — split individual layers across GPUs; lowest latency, needs fast NVLink.
- Pipeline parallelism (PP) — split layers into stages across GPUs/nodes; introduces "bubbles."
- Expert parallelism (EP) — shard MoE experts; high throughput for mixture-of-experts models.
Use TP within a node (high-bandwidth NVLink), PP across nodes as a last resort, and EP for large MoE models.
6. Disaggregation — independent scaling#
Disaggregation runs prefill and decode on separate, independently-scaled hardware — prefill on compute-optimized GPUs, decode on bandwidth-optimized ones. This matches each phase to the hardware it needs and scales them independently for cost efficiency. NVIDIA Dynamo provides dynamic disaggregation.
Putting it together#
A practical acceleration playbook for a typical deployment:
- Quantize the model (FP8 or INT4) — biggest single win.
- Enable continuous batching — if it isn't already.
- Enable prefix caching — especially if you have shared system prompts.
- Add speculative decoding — if you're latency-bound at low batch sizes.
- Scale with parallelism / disaggregation — when you outgrow one GPU.
Key Takeaway
Start with quantization and continuous batching — they give the largest wins for the least effort. Layer in prefix caching, speculative decoding, and parallelism based on whether you're latency- or throughput-bound. Always measure on your own workload.
Frequently asked questions
What is the fastest way to speed up LLM inference?
The highest-leverage wins are usually: (1) use a quantized model (FP8 or INT4) to cut memory traffic, (2) enable continuous batching in your inference engine, and (3) use a faster engine/hardware combination. Speculative decoding helps for low-batch latency. The right lever depends on whether you're latency- or throughput-bound.
Does quantization hurt model quality?
Modern quantization (FP8, INT8 with SmoothQuant, INT4 with AWQ/GPTQ) typically loses under 1-3% on quality benchmarks while halving or quartering memory. The quality impact depends on the model and method; always evaluate on your own task.
How much faster is speculative decoding?
Speculative decoding can improve effective tokens-per-second by 1.5-3x for low-to-medium batch sizes, depending on the draft model's acceptance rate and overhead. At high batch sizes the benefit shrinks. Try our Speculative Decoding Simulator to estimate your speedup.
Keep learning