aicoolies logo

llama-swap

Hot-swap between local LLM models via OpenAI-compatible API

Share
open-sourceOpen Source
Visit Website →

llama-swap is an open-source tool that manages multiple local LLM models behind a single OpenAI-compatible API endpoint. It automatically loads and unloads models on demand, letting developers hot-swap between different models without restarting services. With 3.1K+ GitHub stars, it solves the common pain point of running multiple specialized models on limited hardware.

llama-swap is a transparent HTTP proxy server written in Go that enables automatic hot-swapping between multiple LLM inference servers on a single machine. It works with any OpenAI and Anthropic API compatible backend — llama.cpp, vLLM, Ollama, and others — presenting a unified API endpoint while dynamically loading and unloading models based on incoming requests. When a request arrives with a specific model name, llama-swap checks which upstream server should handle it and automatically starts, stops, or swaps backends as needed.

The project ships as a single binary with zero external dependencies, making deployment remarkably simple across Linux, macOS, Windows, and FreeBSD. Configuration is handled through a single YAML file that maps model names to their corresponding server launch commands and parameters. For basic setups, llama-swap manages one model at a time to conserve GPU memory. Advanced users can leverage the groups feature to run multiple models simultaneously when hardware allows, with configurable memory limits and automatic eviction policies for resource management.

llama-swap is particularly valuable for developers and teams running local AI infrastructure who need access to multiple specialized models but lack the GPU memory to keep them all loaded. Rather than manually stopping and starting servers, llama-swap handles the lifecycle automatically with minimal latency overhead. The project is open-source under the MIT license and provides pre-built binaries for all major platforms. It integrates seamlessly with existing OpenAI-compatible client applications, requiring no code changes on the client side — just point your application at the llama-swap endpoint and reference model names as defined in the configuration.

Pricing

Free and open-source

Platforms

Self-hosted (cross-platform)

Categories

Tags

Use Cases

Alternatives

Ollama logo

Ollama

Run LLMs locally with one command

Tool for running large language models locally on your machine with a simple CLI interface. Download and run Llama 3, Mistral, Gemma, Phi, Code Llama, and dozens of other open-source models with a single command. Features model management, GPU acceleration (NVIDIA/AMD/Apple Silicon), OpenAI-compatible API server, Modelfile for customization, and multi-model switching. Ideal for offline AI development, privacy-sensitive use cases, and local testing. 120K+ GitHub stars.

open-sourceOpen Source
LocalAI logo

LocalAI

Free, open-source local AI inference engine

LocalAI is an open-source local AI inference engine with 44K+ GitHub stars that runs LLMs, image generation, audio transcription, and embeddings entirely on consumer hardware without GPU requirements. Provides an OpenAI API-compatible REST endpoint as a drop-in replacement, supporting 1000+ models including LLaMA, Mistral, and Phi families. Features include text-to-speech, speech-to-text, function calling, constrained grammar output, and multi-modal capabilities all running locally.

open-sourceOpen Source
Open WebUI logo

Open WebUI

Self-hosted AI platform with ChatGPT-like interface for local and cloud LLMs.

Extensible, self-hosted AI platform with 290M+ Docker pulls and 124K+ GitHub stars. Supports Ollama, OpenAI-compatible APIs, and any Chat Completions backend. Features built-in RAG, multi-user RBAC, voice/video calls, Python function workspace, model builder, and web browsing. Runs entirely offline with enterprise features including SSO and audit logging.

free

Related Tools

Ghostty logo

Ghostty

Top Pick

Fast, native terminal emulator

GPU-accelerated terminal emulator written in Zig by Mitchell Hashimoto (HashiCorp co-founder). Native UI rendering on macOS and Linux. Supports ligatures, true color, Kitty graphics protocol, and splits/tabs. Configurable via a simple key-value file with sensible defaults. Open-source with 20K+ GitHub stars and a focus on correctness, speed, and minimal resource usage. Growing as a modern alternative to iTerm2, Alacritty, and WezTerm.

open-sourceOpen Source
Claude Code logo

Claude Code

Top Pick

Anthropic's agentic coding CLI

Anthropic's agentic CLI coding tool that delegates complex tasks to Claude directly from the terminal. Understands entire codebases via automatic context gathering, edits multiple files, runs shell commands, and manages Git workflows autonomously. Supports CLAUDE.md for persistent project instructions, integrates with VS Code and JetBrains, and uses Claude Opus/Sonnet with extended thinking for complex architectural decisions. Built for terminal-first developers.

paidOpen Source
Hermes Agent logo

Hermes Agent

Top Pick

Open-source AI agent framework with persistent memory, reusable skills, tools, and messaging gateways

Hermes Agent is an open-source AI agent framework with persistent memory, reusable skills, 40+ tools, cron jobs, and messaging gateways.

open-sourceOpen Source
pi dev code

Pi

Top Pick

Minimal terminal coding harness

Pi Coding Agent is an MIT-licensed Node.js CLI from earendil-works for building and running coding agents in a local terminal. The current package describes a read/bash/edit/write toolset and session management, while the repo positions Pi as a unified LLM API, agent loop, TUI, and coding-agent CLI. It is best framed as a lean, self-extensible BYO-model toolkit rather than a managed IDE.

open-sourceOpen Source
OpenCode logo

OpenCode

Top Pick

Open-source AI coding agent for the terminal

Open-source terminal-based AI coding agent built in Go by the SST team, with a rich TUI (Bubble Tea) supporting 75+ model providers including OpenAI, Anthropic, Gemini, Bedrock, Groq, and OpenRouter. Features vim-like editing, persistent SQLite sessions, and LSP integration for 40+ languages. Fully free with no vendor lock-in, it has rapidly grown to 95k+ GitHub stars.

open-source
Codex logo

Codex

Top Pick

OpenAI coding agent for app, editor, terminal, and cloud work

Codex is OpenAI's coding agent for software development across the Codex app, editor, terminal, and cloud tasks. It helps write, review, debug, refactor, and automate code, with ChatGPT plan access for managed surfaces and API-key usage for CLI, SDK, and IDE workflows. The open-source CLI and SDK support local repository work, while cloud features add GitHub review, Slack/Linear integrations, worktrees, skills, MCP, and automations.

freemiumOpen Source