Text Generation Inference (TGI)
Hugging Face's production LLM serving toolkit with continuous batching, tensor parallelism, and a Rust-based HTTP server.
Definition
Text Generation Inference (TGI) is Hugging Face's inference server for large language models. Built with a Rust HTTP server front-end and Python/C++ backend, TGI supports continuous batching, tensor parallelism, flash attention, and quantization (GPTQ, AWQ, bitsandbytes). It integrates tightly with the Hugging Face Hub for model downloading and authentication, and exposes the Messages API (OpenAI-compatible) by default. TGI is widely used in Hugging Face Inference Endpoints and is available as a Docker image for self-hosting.
Related
More Software terms
PagedAttention
vLLM's technique for storing KV cache in non-contiguous memory pages, eliminating fragmentation and enabling larger effective batch sizes.
FlashAttention
IO-aware exact attention algorithm that tiles computation to stay in SRAM, cutting HBM reads/writes and speeding up attention by 2–4×.
Continuous Batching
Scheduling technique that adds new requests to a running batch as soon as any sequence finishes, maximising GPU utilisation compared to static batching.