Skip to content
Software

TensorRT-LLM

NVIDIA's optimised inference library for LLMs, generating highly tuned CUDA kernels via TensorRT with support for FP8, AWQ, and multi-GPU serving.

Definition

TensorRT-LLM is NVIDIA's production-grade LLM inference library. It uses NVIDIA's TensorRT compiler to generate optimised CUDA kernels tailored to specific model architectures, precisions, and sequence lengths. Key features include FP8 inference on H100, in-flight batching, speculative decoding, LoRA serving, and first-class multi-GPU tensor and pipeline parallelism. It is the fastest option for NVIDIA hardware when the model is supported, though it requires a compilation step and is less flexible than Python-first frameworks like vLLM.

More Software terms