Skip to content

Recommended Reading

25 papers, tools, and resources referenced throughout Inference Engineering.

NVIDIA Dynamo

NVIDIA · 2025

Open-source distributed serving platform for disaggregation and multi-GPU orchestration.

Developer Tools

FlashInfer

Ye et al. · 2025

Efficient and customizable attention engine for LLM inference serving.

Developer Tools

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Tri Dao et al. · 2024

Hopper-optimized attention with async data transfer and FP8 support.

Inference Optimization

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li et al. · 2024

Purpose-built draft models using hidden states for high acceptance rate speculation.

Inference Optimization

Medusa: Simple LLM Inference Acceleration Framework

Tianle Cai et al. · 2024

Adding extra decoder heads for multi-token speculation without a separate draft model.

Inference Optimization

AWQ: Activation-aware Weight Quantization

Song Han (MIT) · 2024

Activation-aware weight quantization preserving quality during compression.

Inference Optimization

SageAttention: Accurate 8-Bit Attention

Jintao Zhang et al. · 2024

Plug-and-play FP8 attention for image and video generation acceleration.

Inference Optimization

The Llama 3 Herd of Models

Grattafiori et al. · 2024

Meta's Llama 3 paper with insights on GPU reliability and large-scale training.

Frontier Models

DeepSeek-V3 Technical Report

DeepSeek AI · 2024

671B MoE model with Multi-Latent Attention and DeepGEMM for FP8 inference.

Frontier Models

SWE-Bench

Carlos Jimenez et al. · 2024

Benchmark for evaluating LLMs on real-world GitHub issues.

Evaluation

Efficient Memory Management for LLM Serving with PagedAttention

Woosuk Kwon et al. · 2023

The vLLM paper introducing PagedAttention for efficient KV cache management.

Inference Optimization

Ring Attention with Blockwise Transformers

Hao Lin et al. · 2023

Context parallelism mechanism for near-infinite context with distributed attention.

Inference Optimization

vLLM

UC Berkeley / Linux Foundation · 2023

The most widely-used open-source inference engine with PagedAttention.

Developer Tools

SGLang

LMSYS · 2023

Fast inference engine with flexible frontend/backends and strong MoE support.

Developer Tools

TensorRT-LLM

NVIDIA · 2023

NVIDIA's high-performance inference engine with deep hardware integration.

Developer Tools

FlashAttention: Fast and Memory-Efficient Exact Attention

Tri Dao et al. · 2022

IO-aware attention algorithm that minimizes memory traffic for faster inference.

Inference Optimization

Fast Inference from Transformers via Speculative Decoding

Yaniv Leviathan et al. · 2022

The original speculative decoding paper for generating multiple tokens per pass.

Inference Optimization

GPTQ: Accurate Post-Training Quantization

Elias Frantar · 2022

One-shot post-training quantization for generative pre-trained transformers.

Inference Optimization

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers et al. · 2022

Pioneering work on INT8 inference for large transformers.

Inference Optimization

MTEB: Massive Text Embedding Benchmark

Niklas Muenninghoff et al. · 2022

Comprehensive benchmark for text embedding models across diverse tasks.

Evaluation

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach et al. · 2021

The Stable Diffusion paper introducing latent space diffusion for image generation.

Architecture

MMLU

Dan Hendrycks et al. · 2021

Massive Multitask Language Understanding benchmark for measuring model knowledge.

Evaluation

Megatron-LM: Training Multi-Billion Parameter Language Models

Mohammad Shoeybi et al. · 2019

Model parallelism techniques for training and serving massive language models.

Architecture

Attention Is All You Need

Vaswani et al. · 2017

The seminal paper introducing the transformer architecture.

Architecture

CUTLASS

NVIDIA · 2017

CUDA C++ template library for high-performance GEMM kernels.

Developer Tools