PrismML emerged from stealth in March 2026 with Bonsai, a family of 1-bit language models that achieve dramatic efficiency gains without proportional quality loss. The 8B parameter model requires only 1GB of memory compared to 16GB for a standard FP16 Llama 3 8B, representing a 14x reduction in model size. Inference runs at 44 tokens per second on an iPhone and scales even faster on desktop hardware. This efficiency breakthrough makes capable language models practical for mobile devices, IoT endpoints, and any environment where compute and memory are constrained.
The technical approach uses a novel 1-bit quantization architecture with FP16 scale factors applied every 128 bits. PrismML provides custom forks of llama.cpp and MLX optimized for 1-bit inference, along with demo code, Colab notebooks, and developer integration documentation. The models are available on HuggingFace under the Apache 2.0 license. AnythingLLM integrated Bonsai models on launch day, demonstrating immediate ecosystem adoption and compatibility with existing local LLM infrastructure.
With $16.25M in funding from Khosla Ventures, Cerberus, and Google, PrismML has the backing to develop the 1-bit quantization toolchain into a comprehensive platform. The Bonsai 8B, 4B, and 1.7B models provide different capability-efficiency tradeoffs for various deployment scenarios. The 355-point Hacker News Show HN launch and positive reception on r/LocalLLaMA confirm strong community interest in edge-efficient LLMs. For developers building on-device AI experiences, Bonsai represents the most practical path to running capable models without cloud dependencies.