Software

Text Generation Inference (TGI)

Hugging Face's production LLM serving toolkit with continuous batching, tensor parallelism, and a Rust-based HTTP server.

Definition

Text Generation Inference (TGI) is Hugging Face's inference server for large language models. Built with a Rust HTTP server front-end and Python/C++ backend, TGI supports continuous batching, tensor parallelism, flash attention, and quantization (GPTQ, AWQ, bitsandbytes). It integrates tightly with the Hugging Face Hub for model downloading and authentication, and exposes the Messages API (OpenAI-compatible) by default. TGI is widely used in Hugging Face Inference Endpoints and is available as a Docker image for self-hosting.

vLLM Continuous Batching

More Software terms

PagedAttention

vLLM's technique for storing KV cache in non-contiguous memory pages, eliminating fragmentation and enabling larger effective batch sizes.

FlashAttention

IO-aware exact attention algorithm that tiles computation to stay in SRAM, cutting HBM reads/writes and speeding up attention by 2–4×.

Continuous Batching

Scheduling technique that adds new requests to a running batch as soon as any sequence finishes, maximising GPU utilisation compared to static batching.

Back to Glossary Start Reading — Chapter 0