Recommended Reading
25 papers, tools, and resources referenced throughout Inference Engineering.
NVIDIA Dynamo
NVIDIA · 2025
Open-source distributed serving platform for disaggregation and multi-GPU orchestration.
FlashInfer
Ye et al. · 2025
Efficient and customizable attention engine for LLM inference serving.
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
Tri Dao et al. · 2024
Hopper-optimized attention with async data transfer and FP8 support.
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Yuhui Li et al. · 2024
Purpose-built draft models using hidden states for high acceptance rate speculation.
Medusa: Simple LLM Inference Acceleration Framework
Tianle Cai et al. · 2024
Adding extra decoder heads for multi-token speculation without a separate draft model.
AWQ: Activation-aware Weight Quantization
Song Han (MIT) · 2024
Activation-aware weight quantization preserving quality during compression.
SageAttention: Accurate 8-Bit Attention
Jintao Zhang et al. · 2024
Plug-and-play FP8 attention for image and video generation acceleration.
The Llama 3 Herd of Models
Grattafiori et al. · 2024
Meta's Llama 3 paper with insights on GPU reliability and large-scale training.
DeepSeek-V3 Technical Report
DeepSeek AI · 2024
671B MoE model with Multi-Latent Attention and DeepGEMM for FP8 inference.
SWE-Bench
Carlos Jimenez et al. · 2024
Benchmark for evaluating LLMs on real-world GitHub issues.
Efficient Memory Management for LLM Serving with PagedAttention
Woosuk Kwon et al. · 2023
The vLLM paper introducing PagedAttention for efficient KV cache management.
Ring Attention with Blockwise Transformers
Hao Lin et al. · 2023
Context parallelism mechanism for near-infinite context with distributed attention.
vLLM
UC Berkeley / Linux Foundation · 2023
The most widely-used open-source inference engine with PagedAttention.
SGLang
LMSYS · 2023
Fast inference engine with flexible frontend/backends and strong MoE support.
TensorRT-LLM
NVIDIA · 2023
NVIDIA's high-performance inference engine with deep hardware integration.
FlashAttention: Fast and Memory-Efficient Exact Attention
Tri Dao et al. · 2022
IO-aware attention algorithm that minimizes memory traffic for faster inference.
Fast Inference from Transformers via Speculative Decoding
Yaniv Leviathan et al. · 2022
The original speculative decoding paper for generating multiple tokens per pass.
GPTQ: Accurate Post-Training Quantization
Elias Frantar · 2022
One-shot post-training quantization for generative pre-trained transformers.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Tim Dettmers et al. · 2022
Pioneering work on INT8 inference for large transformers.
MTEB: Massive Text Embedding Benchmark
Niklas Muenninghoff et al. · 2022
Comprehensive benchmark for text embedding models across diverse tasks.
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach et al. · 2021
The Stable Diffusion paper introducing latent space diffusion for image generation.
MMLU
Dan Hendrycks et al. · 2021
Massive Multitask Language Understanding benchmark for measuring model knowledge.
Megatron-LM: Training Multi-Billion Parameter Language Models
Mohammad Shoeybi et al. · 2019
Model parallelism techniques for training and serving massive language models.
Attention Is All You Need
Vaswani et al. · 2017
The seminal paper introducing the transformer architecture.
CUTLASS
NVIDIA · 2017
CUDA C++ template library for high-performance GEMM kernels.