TensorRT-LLM is NVIDIA's purpose-built library for squeezing maximum inference performance out of large language models on NVIDIA GPUs. It takes models from frameworks like PyTorch and Hugging Face Transformers and compiles them into highly optimized TensorRT engines with kernel fusion, mixed-precision execution, and advanced memory management. The library supports FP8 inference on H100 and Blackwell GPUs for significant throughput improvements, along with INT4 and INT8 quantization for reducing memory footprint without severe quality loss.
For production-scale deployment, TensorRT-LLM provides tensor parallelism and pipeline parallelism to distribute models across multiple GPUs and nodes. Its in-flight batching system dynamically groups inference requests for maximum GPU utilization, while KV cache management with paged attention reduces memory waste. The library works with a wide range of model architectures including LLaMA, GPT, Mistral, Mixtral, Falcon, Qwen, Baichuan, and many others, with pre-built optimization profiles for common configurations.
TensorRT-LLM is open-source under Apache 2.0 and integrates natively with NVIDIA Triton Inference Server for serving, as well as with NVIDIA NIM for containerized deployment. While it requires NVIDIA GPU hardware, it delivers state-of-the-art inference throughput that justifies the hardware specificity for organizations running LLMs at scale. The library receives regular updates aligned with new GPU architectures and model releases from the open-source community.