exo is an open-source distributed inference engine that pools compute resources across multiple consumer devices to run AI models that exceed the memory capacity of any single machine. Where traditional approaches require expensive server-grade GPUs or cloud instances, exo lets developers combine the hardware they already own — MacBooks, gaming PCs, workstations, even phones — into a single inference cluster. The system automatically handles model partitioning, device discovery, and inter-node communication.
The technical foundation is a dynamic model sharding algorithm that splits transformer layers across available devices based on their memory and compute capabilities. Communication between nodes uses RDMA over Thunderbolt for local clusters or standard TCP for distributed setups. exo supports heterogeneous hardware mixing: an Apple Silicon MacBook can collaborate with an NVIDIA RTX workstation in the same cluster. Supported inference engines include MLX for Apple Silicon, NVIDIA CUDA via tinygrad, and CPU fallbacks.
With over 43,000 GitHub stars, exo has become the leading open-source solution for multi-device LLM inference. Practical demonstrations include running 671-billion-parameter models across clusters of Ryzen AI Max laptops and trillion-parameter inference across four AMD workstations. The project is Apache 2.0 licensed and developed by Exo Labs. It provides an OpenAI-compatible API endpoint, a ChatGPT-style web interface, and automatic device discovery on local networks.