Text Generation Inference (TGI) is the serving engine that powers Hugging Face's own Inference API and Inference Endpoints, serving millions of requests daily across the Hugging Face ecosystem. Written in Rust for performance and safety, it implements flash attention for memory-efficient inference, continuous batching that dynamically groups requests for maximum GPU utilization, and tensor parallelism for distributing large models across multiple GPUs. With over 10,000 GitHub stars, TGI has become a proven choice for production LLM serving.
TGI supports a wide range of quantization methods including GPTQ, AWQ, EETQ, and bitsandbytes for reducing model memory footprint without significant quality loss. It natively handles the Safetensors format for secure model loading, provides structured output generation via grammars, and offers watermarking capabilities. The server exposes an OpenAI-compatible API for easy integration with existing applications, along with a gRPC interface for high-performance inter-service communication.
Deployment is Docker-first with pre-built images that include all necessary CUDA libraries and dependencies. A single docker run command with the model ID is enough to start serving any supported model from the Hugging Face Hub. TGI supports model architectures including LLaMA, Mistral, Mixtral, Falcon, StarCoder, GPT-NeoX, BLOOM, and many more. For organizations already invested in the Hugging Face ecosystem, TGI provides the natural serving layer that maintains compatibility with the Hub's model management and versioning capabilities.