llama-swap is a transparent HTTP proxy server written in Go that enables automatic hot-swapping between multiple LLM inference servers on a single machine. It works with any OpenAI and Anthropic API compatible backend — llama.cpp, vLLM, Ollama, and others — presenting a unified API endpoint while dynamically loading and unloading models based on incoming requests. When a request arrives with a specific model name, llama-swap checks which upstream server should handle it and automatically starts, stops, or swaps backends as needed.
The project ships as a single binary with zero external dependencies, making deployment remarkably simple across Linux, macOS, Windows, and FreeBSD. Configuration is handled through a single YAML file that maps model names to their corresponding server launch commands and parameters. For basic setups, llama-swap manages one model at a time to conserve GPU memory. Advanced users can leverage the groups feature to run multiple models simultaneously when hardware allows, with configurable memory limits and automatic eviction policies for resource management.
llama-swap is particularly valuable for developers and teams running local AI infrastructure who need access to multiple specialized models but lack the GPU memory to keep them all loaded. Rather than manually stopping and starting servers, llama-swap handles the lifecycle automatically with minimal latency overhead. The project is open-source under the MIT license and provides pre-built binaries for all major platforms. It integrates seamlessly with existing OpenAI-compatible client applications, requiring no code changes on the client side — just point your application at the llama-swap endpoint and reference model names as defined in the configuration.