Cactus achieves production-grade LLM inference on mobile through cross-platform energy-efficient kernels optimized for the processors that power modern smartphones. ARM SIMD kernels tuned for Snapdragon, Apple, and MediaTek chipsets deliver 16-20 tokens per second on older devices like Pixel 6a and iPhone 11, while latest flagships exceed 70 tokens per second. The zero-copy Cactus Graph computation engine minimizes memory overhead, and intelligent power management balances responsiveness with battery conservation.
The SDK supports a wide range of open model architectures including Qwen, Gemma, Llama, DeepSeek, Phi, and Mistral, with an OpenAI-compatible API for familiar integration patterns. Beyond text generation, Cactus handles voice transcription, text-to-speech, embedding generation, and tool calling. An automatic cloud fallback mechanism handles requests that exceed device capabilities, providing graceful degradation rather than hard failures for computationally intensive operations.
Cross-platform SDKs cover Flutter, React Native, Kotlin Multiplatform, and web, with framework-specific features like RAG fine-tuning in Flutter and image embedding in React Native. Backed by Y Combinator, the project offers free access to students, educators, nonprofits, and small businesses, lowering the barrier to building private on-device AI applications. For developers who need their AI features to work offline, protect user privacy, and deliver responsive interactions, Cactus provides the mobile-first inference infrastructure to make it practical.