aicoolies logo

exo

Run frontier AI models across a cluster of everyday devices

Share
open-sourceOpen Source
Visit Website →

exo turns multiple local machines into a unified AI compute cluster for models that exceed a single device's memory. It automatically discovers devices, uses topology-aware auto parallelism to split work across available resources, and supports RDMA over Thunderbolt 5 for co-located clusters or standard networking for looser setups. The project exposes OpenAI Chat Completions, Claude Messages, OpenAI Responses, and Ollama-compatible APIs plus a dashboard for cluster management.

We have a review for this tool

A detailed review by the aicoolies team — click to read

exo is an open-source distributed inference engine that pools compute resources across multiple consumer devices to run AI models that exceed the memory capacity of any single machine. Where traditional approaches require expensive server-grade GPUs or cloud instances, exo lets developers combine the hardware they already own into a single local inference cluster. The system automatically handles device discovery, topology-aware work splitting, and inter-node communication.

The technical foundation is topology-aware auto parallelism that splits work across available devices based on memory, compute, latency, and bandwidth. Communication between nodes can use RDMA over Thunderbolt 5 for co-located clusters or standard networking for looser setups. The current README emphasizes MLX and MLX distributed communication, plus compatibility with OpenAI Chat Completions, Claude Messages, OpenAI Responses, and Ollama APIs for client access.

With about 45K GitHub stars, exo has become one of the most visible open-source projects for multi-device LLM inference. Public README benchmark examples include DeepSeek v3.1 671B and Kimi K2 Thinking on 4 × M3 Ultra Mac Studio with Tensor Parallel RDMA. The project is Apache 2.0 licensed and developed by Exo Labs. It provides familiar API compatibility, a dashboard for managing the cluster, and automatic device discovery on local networks.

Pricing

Free and open-source under Apache 2.0

Platforms

macOS/Linux source paths; MLX distributed; Thunderbolt 5 RDMA or TCP; OpenAI/Claude/Ollama-compatible APIs

Categories

Tags

Use Cases

Alternatives

Ollama logo

Ollama

Run LLMs locally with one command

Tool for running large language models locally on your machine with a simple CLI interface. Download and run Llama 3, Mistral, Gemma, Phi, Code Llama, and dozens of other open-source models with a single command. Features model management, GPU acceleration (NVIDIA/AMD/Apple Silicon), OpenAI-compatible API server, Modelfile for customization, and multi-model switching. Ideal for offline AI development, privacy-sensitive use cases, and local testing. 120K+ GitHub stars.

open-sourceOpen Source
Lemonade logo

Lemonade

AMD's open-source local LLM server with GPU and NPU acceleration

Lemonade is AMD's open-source local AI serving platform for LLMs, image generation, speech recognition, and text-to-speech on your own hardware. Built in lightweight C++, it can detect CPU, GPU, and NPU backends and is extra optimized for Ryzen AI, Radeon, and Strix Halo PCs. Lemonade exposes OpenAI, Anthropic, and Ollama-compatible APIs, ships with a desktop model manager, and supports source-confirmed GGUF, FLM, and ONNX models across Windows, Linux, macOS, and Docker.

open-sourceOpen Source
vLLM logo

vLLM

High-throughput LLM serving engine

vLLM is an Apache-2.0 LLM inference and serving engine focused on high-throughput self-hosted model APIs. It combines PagedAttention, continuous batching, prefix caching, quantization options, OpenAI-compatible serving, structured outputs, metrics, Docker/Kubernetes deployment guidance and integrations with agent and LLM frameworks.

open-sourceOpen Source
llama.cpp logo

llama.cpp

High-performance local LLM inference in C/C++

llama.cpp is the foundational C/C++ library with 75K+ GitHub stars powering local LLM inference on consumer hardware. Provides optimized CPU and GPU inference for quantized models in GGUF format. Supports LLaMA, Mistral, Phi, Gemma, and most open-weight families. Features 2-8 bit quantization for reduced memory, multi-GPU support, context extension, grammar-constrained output, and an OpenAI-compatible API server. The engine behind Ollama and LM Studio.

open-sourceOpen Source

Related Tools

Claude

Claude

Top Pick

Anthropic's frontier AI assistant

Anthropic's AI assistant known for strong reasoning, nuanced writing, and extended context up to 200K tokens. Available in Opus (most capable), Sonnet (balanced), and Haiku (fast) tiers. Features web search, deep research, file analysis, code execution, artifacts, and Projects for organized workflows. Claude Code provides terminal-based agentic coding. API supports tool use, batch processing, and prompt caching. Available via claude.ai, mobile apps, and developer API.

freemium

KubeAI

Kubernetes operator for serving AI inference workloads

KubeAI is an Apache-2.0 Kubernetes operator for deploying and scaling AI inference workloads, including LLMs, embeddings, reranking, and speech-to-text. It gives platform teams OpenAI-compatible endpoints, model proxy/controller primitives, model caching, scale-from-zero behavior, and cluster-native resource management for self-hosted inference on Kubernetes.

open-sourceOpen Source

CLIProxyAPI

Self-hosted proxy API for routing AI CLI accounts into OpenAI-compatible endpoints

CLIProxyAPI is an open-source Go proxy server that wraps Gemini CLI, Claude Code, OpenAI Codex, Grok Build, and related CLI account flows behind OpenAI/Gemini/Claude-compatible API endpoints. Use it carefully: it can touch OAuth sessions, auth files, logs, and provider account policies, so production use needs credential and ToS review.

open-sourceOpen SourceTelemetry
xAI Python SDK logo

xAI Python SDK

Official Python SDK for the xAI API

The xAI Python SDK is the official Python client for the xAI API, giving developers a direct way to build Grok-powered apps without relying on community proxies or unofficial wrappers. It supports synchronous and asynchronous Python clients for chat completions, streaming responses, function/tool calling, and multimodal workflows, making it a clean fit for backend services, agents, notebooks, and developer tools that need programmatic xAI access.

open-sourceOpen Source
OpenHuman logo

OpenHuman

Local-first personal AI agent with memory trees, desktop integrations, and private workspace context.

OpenHuman is an open-source, local-first personal AI agent from TinyHumans. It combines a desktop app, persistent memory trees, Obsidian-compatible storage, OAuth integrations, and local model support into a private assistant harness. It is most interesting for users who want agentic workflows and long-term memory without handing every context detail to a fully cloud-hosted assistant.

open-sourceOpen SourceTelemetry
DenchClaw logo

DenchClaw

Local AI CRM and workflow automation on OpenClaw

DenchClaw is a local AI CRM and workflow automation app built on OpenClaw. It runs on a Mac at localhost, lets users chat with local business data, and focuses on lead enrichment, founder/customer research, and outreach automation. It belongs beside local AI, workflow automation, and OpenClaw-style personal-agent tools rather than pure coding IDEs.

open-sourceOpen Source

Comparisons