exo is an open-source distributed inference engine that pools compute resources across multiple consumer devices to run AI models that exceed the memory capacity of any single machine. Where traditional approaches require expensive server-grade GPUs or cloud instances, exo lets developers combine the hardware they already own into a single local inference cluster. The system automatically handles device discovery, topology-aware work splitting, and inter-node communication.

The technical foundation is topology-aware auto parallelism that splits work across available devices based on memory, compute, latency, and bandwidth. Communication between nodes can use RDMA over Thunderbolt 5 for co-located clusters or standard networking for looser setups. The current README emphasizes MLX and MLX distributed communication, plus compatibility with OpenAI Chat Completions, Claude Messages, OpenAI Responses, and Ollama APIs for client access.

With about 45K GitHub stars, exo has become one of the most visible open-source projects for multi-device LLM inference. Public README benchmark examples include DeepSeek v3.1 671B and Kimi K2 Thinking on 4 × M3 Ultra Mac Studio with Tensor Parallel RDMA. The project is Apache 2.0 licensed and developed by Exo Labs. It provides familiar API compatibility, a dashboard for managing the cluster, and automatic device discovery on local networks.

exo vs Ollama — Multi-Device Distributed Inference vs Single-Machine Local LLM

exo and Ollama both enable running LLMs locally without cloud dependencies, but they solve fundamentally different scaling problems. Ollama is the simplest path to single-machine inference with 95,000+ GitHub stars and the broadest model ecosystem. exo pools compute across multiple consumer devices to run models that exceed any single machine's capacity, enabling 100B+ parameter inference on hardware you already own.