TensorRT-LLM
NVIDIA's optimised inference library for LLMs, generating highly tuned CUDA kernels via TensorRT with support for FP8, AWQ, and multi-GPU serving.
Definition
TensorRT-LLM is NVIDIA's production-grade LLM inference library. It uses NVIDIA's TensorRT compiler to generate optimised CUDA kernels tailored to specific model architectures, precisions, and sequence lengths. Key features include FP8 inference on H100, in-flight batching, speculative decoding, LoRA serving, and first-class multi-GPU tensor and pipeline parallelism. It is the fastest option for NVIDIA hardware when the model is supported, though it requires a compilation step and is less flexible than Python-first frameworks like vLLM.
Related
More Software terms
PagedAttention
vLLM's technique for storing KV cache in non-contiguous memory pages, eliminating fragmentation and enabling larger effective batch sizes.
FlashAttention
IO-aware exact attention algorithm that tiles computation to stay in SRAM, cutting HBM reads/writes and speeding up attention by 2–4×.
Continuous Batching
Scheduling technique that adds new requests to a running batch as soon as any sequence finishes, maximising GPU utilisation compared to static batching.