Software

TensorRT-LLM

NVIDIA's optimised inference library for LLMs, generating highly tuned CUDA kernels via TensorRT with support for FP8, AWQ, and multi-GPU serving.

Definition

TensorRT-LLM is NVIDIA's production-grade LLM inference library. It uses NVIDIA's TensorRT compiler to generate optimised CUDA kernels tailored to specific model architectures, precisions, and sequence lengths. Key features include FP8 inference on H100, in-flight batching, speculative decoding, LoRA serving, and first-class multi-GPU tensor and pipeline parallelism. It is the fastest option for NVIDIA hardware when the model is supported, though it requires a compilation step and is less flexible than Python-first frameworks like vLLM.

vLLM FP8 Tensor Core

More Software terms

PagedAttention

vLLM's technique for storing KV cache in non-contiguous memory pages, eliminating fragmentation and enabling larger effective batch sizes.

FlashAttention

IO-aware exact attention algorithm that tiles computation to stay in SRAM, cutting HBM reads/writes and speeding up attention by 2–4×.

Continuous Batching

Scheduling technique that adds new requests to a running batch as soon as any sequence finishes, maximising GPU utilisation compared to static batching.

Back to Glossary Start Reading — Chapter 0