llm-d addresses the operational complexity of running large language model inference at scale on Kubernetes. While individual serving engines like vLLM handle the mechanics of running models on GPUs, production deployments require an orchestration layer that manages routing, scheduling, scaling, and resource allocation across a fleet of GPU nodes. llm-d provides this orchestration through a Kubernetes-native architecture that uses custom resources and operators to declare inference topologies, with intelligent routing that considers KV cache state, GPU memory availability, and request characteristics when assigning work to nodes.
The disaggregated serving architecture separates the prefill stage (processing the input prompt) from the decode stage (generating output tokens) across different GPU pools. This separation enables significant efficiency gains because prefill is compute-intensive and benefits from high-bandwidth GPUs, while decode is memory-bandwidth-limited and can run on different hardware configurations. The cache-aware routing system tracks which prompts have been processed on which nodes, directing subsequent requests to nodes that already have relevant KV cache entries warm in GPU memory, avoiding redundant computation for conversations and repeated system prompts.
llm-d builds on vLLM as its serving engine while adding the cluster-level intelligence that transforms individual GPU servers into a coordinated inference platform. The project integrates with Kubernetes' native scaling mechanisms for automatic GPU allocation based on request volume, and supports mixed hardware configurations where different model sizes and quantization levels are served across heterogeneous GPU pools. With 2,900+ GitHub stars and an Apache-2.0 license, llm-d targets AI platform teams that need production-grade inference infrastructure beyond what a single vLLM instance provides.