aicoolies logo

llm-d

Kubernetes-native distributed LLM inference stack

Share
open-sourceOpen Source
Visit Website →

llm-d is an open-source Kubernetes-native stack for distributed LLM inference with cache-aware routing and disaggregated serving. It separates prefill and decode stages across different GPU pools for optimal resource utilization, routes requests to nodes with warm KV caches, and integrates with vLLM as the serving engine. Apache-2.0 licensed with 2,900+ GitHub stars.

llm-d addresses the operational complexity of running large language model inference at scale on Kubernetes. While individual serving engines like vLLM handle the mechanics of running models on GPUs, production deployments require an orchestration layer that manages routing, scheduling, scaling, and resource allocation across a fleet of GPU nodes. llm-d provides this orchestration through a Kubernetes-native architecture that uses custom resources and operators to declare inference topologies, with intelligent routing that considers KV cache state, GPU memory availability, and request characteristics when assigning work to nodes.

The disaggregated serving architecture separates the prefill stage (processing the input prompt) from the decode stage (generating output tokens) across different GPU pools. This separation enables significant efficiency gains because prefill is compute-intensive and benefits from high-bandwidth GPUs, while decode is memory-bandwidth-limited and can run on different hardware configurations. The cache-aware routing system tracks which prompts have been processed on which nodes, directing subsequent requests to nodes that already have relevant KV cache entries warm in GPU memory, avoiding redundant computation for conversations and repeated system prompts.

llm-d builds on vLLM as its serving engine while adding the cluster-level intelligence that transforms individual GPU servers into a coordinated inference platform. The project integrates with Kubernetes' native scaling mechanisms for automatic GPU allocation based on request volume, and supports mixed hardware configurations where different model sizes and quantization levels are served across heterogeneous GPU pools. With 2,900+ GitHub stars and an Apache-2.0 license, llm-d targets AI platform teams that need production-grade inference infrastructure beyond what a single vLLM instance provides.

Pricing

Free and open source (Apache-2.0)

Platforms

Kubernetes — Helm charts, requires GPU nodes with vLLM

Categories

Tags

Use Cases

Alternatives

Related Tools

Claude

Claude

Top Pick

Anthropic's frontier AI assistant

Anthropic's AI assistant known for strong reasoning, nuanced writing, and extended context up to 200K tokens. Available in Opus (most capable), Sonnet (balanced), and Haiku (fast) tiers. Features web search, deep research, file analysis, code execution, artifacts, and Projects for organized workflows. Claude Code provides terminal-based agentic coding. API supports tool use, batch processing, and prompt caching. Available via claude.ai, mobile apps, and developer API.

freemium

KubeAI

Kubernetes operator for serving AI inference workloads

KubeAI is an Apache-2.0 Kubernetes operator for deploying and scaling AI inference workloads, including LLMs, embeddings, reranking, and speech-to-text. It gives platform teams OpenAI-compatible endpoints, model proxy/controller primitives, model caching, scale-from-zero behavior, and cluster-native resource management for self-hosted inference on Kubernetes.

open-sourceOpen Source
xAI Python SDK logo

xAI Python SDK

Official Python SDK for the xAI API

The xAI Python SDK is the official Python client for the xAI API, giving developers a direct way to build Grok-powered apps without relying on community proxies or unofficial wrappers. It supports synchronous and asynchronous Python clients for chat completions, streaming responses, function/tool calling, and multimodal workflows, making it a clean fit for backend services, agents, notebooks, and developer tools that need programmatic xAI access.

open-sourceOpen Source
Freestyle logo

Freestyle

Sandboxes for coding agents — Linux VMs, Git, and deploys in one box

Freestyle is YC-backed sandbox infrastructure built for AI coding agents, shipping secure Linux VMs with nested virtualization, Git servers, and one-click web deploys. It lets agents run real workloads, branch repos, and deploy apps under short-lived identities while billing only for active compute. Used in production by vly.ai, Rork, and Vibeflow.

freemium
OpenSRE logo

OpenSRE

Open-source toolkit for building AI SRE incident response agents

OpenSRE is Tracer Cloud’s open-source public-alpha Python toolkit for building AI SRE agents that investigate and respond to production incidents. It ships 60+ tools across observability, databases, incident management, communications, deployment and protocol integrations, plus simulation/evaluation workflows for benchmarking agent accuracy before live pager use.

open-sourceOpen Source
Cerebras logo

Cerebras

Wafer-scale inference at thousands of tokens per second

Cerebras Inference serves open-weight LLMs like Llama, Qwen, and GPT-OSS on wafer-scale CS-3 chips through an OpenAI-compatible API, benchmarking between 1,800 and 2,600 output tokens per second on Llama 3.1 8B and several hundred on 70B models. A free tier offers one million tokens per day with no credit card, while paid pay-per-token pricing starts at $0.04 per million tokens for the smaller Llama models.

freemium