Cactus is a mobile-first AI inference engine developed by Cactus Compute, a Y Combinator S25 company, that brings large language model execution to smartphones and wearable devices. While the local AI movement has focused primarily on desktop and server hardware through tools like Ollama and llama.cpp, Cactus targets the constrained environment of mobile processors where memory, power, and thermal budgets are dramatically tighter. The engine achieves the fastest publicly benchmarked inference speeds on ARM CPUs among open-source solutions.
The SDK covers the major mobile development platforms with native integrations for iOS via Swift, Android via Kotlin and Java, cross-platform via Flutter and React Native, and web via JavaScript. Each platform integration handles model loading, memory management, and hardware acceleration natively rather than through generic wrappers. Apple NPU support leverages the Neural Engine in A-series and M-series chips for inference acceleration. The engine supports GGUF model formats from Hugging Face with quantization levels down to 2-bit, enabling models that would typically require gigabytes of RAM to run within mobile memory constraints.
Beyond basic text generation, Cactus supports RAG pipelines on-device by combining LLM inference with local embedding generation, enabling mobile applications to search and reason over private documents without any network connectivity. Vision model support through LLaVA and speech processing through Whisper extend the engine to multi-modal mobile AI applications. With 4,600+ GitHub stars and YC backing, Cactus fills a genuine gap in the AI tooling landscape — the infrastructure layer between server-grade local AI and the mobile devices that most people actually use.