What This Stack Does
This stack maximizes the capabilities of AMD Ryzen AI hardware by combining purpose-built tools that exploit NPU, GPU, and CPU acceleration. Lemonade serves as the primary inference engine, using hybrid NPU-GPU execution for optimal throughput and power efficiency on Ryzen AI 300-series and newer processors. Open WebUI provides a rich chat interface with conversation history, model switching, and RAG capabilities on top of Lemonade's API.
Multi-Modal Routing and Model Selection
The architecture routes all multi-modal requests through Lemonade's OpenAI-compatible API. Chat completion, image generation via Stable Diffusion, speech-to-text via Whisper, and text-to-speech via Kokoro all flow through the same endpoint. Open WebUI connects to this single URL for a unified experience. Ollama runs alongside as a fallback runtime for GGUF models that may not yet have AMD-optimized variants in Lemonade's supported formats.
Model selection follows a practical hierarchy. Lemonade handles primary LLM inference using ONNX models optimized for Ryzen AI NPU execution, delivering the best performance-per-watt on AMD hardware. For models only available in GGUF format, Lemonade's llama.cpp backend provides GPU-accelerated inference via ROCm or Vulkan. Ollama serves as the compatibility fallback for edge cases where a specific model or quantization variant is only available in Ollama's registry.
From Prototyping to Privacy-First Production
The development workflow starts with Lemonade's desktop application for model discovery, downloading, and testing. Developers browse models, evaluate them in the built-in chat interface, and then connect their applications to the API. Open WebUI extends this into a production-ready chat platform with user management, document upload for RAG, web search integration, and conversation sharing.
Privacy and cost advantages are the stack's core selling points. All inference runs on local hardware with no data leaving the machine. There are no per-token API costs, no usage limits, and no internet dependency. The NPU-accelerated inference on Ryzen AI hardware keeps power consumption low enough for sustained use on laptop battery, making mobile local AI practical rather than theoretical.
The Bottom Line
The total cost is zero for software — all three tools are open-source under permissive licenses. The hardware investment depends on the AMD system, ranging from existing Ryzen AI laptops to dedicated workstations. For teams already on AMD hardware, this stack unlocks AI capabilities that were previously locked behind cloud API subscriptions.