aicoolies logo
PrismML Bonsai logo

PrismML Bonsai

First commercially viable 1-bit LLMs that are 14x smaller and 8x faster

Share
open-sourceOpen Source
Visit Website →

PrismML Bonsai delivers the first commercially viable 1-bit large language models with 8B, 4B, and 1.7B parameter variants. The 8B model runs in just 1GB of RAM versus 16GB for standard FP16 models, achieving 44 tokens per second on iPhone. Backed by $16.25M from Khosla Ventures and released under Apache 2.0, Bonsai makes capable LLMs practical for edge devices and resource-constrained environments.

PrismML emerged from stealth in March 2026 with Bonsai, a family of 1-bit language models that achieve dramatic efficiency gains without proportional quality loss. The 8B parameter model requires only 1GB of memory compared to 16GB for a standard FP16 Llama 3 8B, representing a 14x reduction in model size. Inference runs at 44 tokens per second on an iPhone and scales even faster on desktop hardware. This efficiency breakthrough makes capable language models practical for mobile devices, IoT endpoints, and any environment where compute and memory are constrained.

The technical approach uses a novel 1-bit quantization architecture with FP16 scale factors applied every 128 bits. PrismML provides custom forks of llama.cpp and MLX optimized for 1-bit inference, along with demo code, Colab notebooks, and developer integration documentation. The models are available on HuggingFace under the Apache 2.0 license. AnythingLLM integrated Bonsai models on launch day, demonstrating immediate ecosystem adoption and compatibility with existing local LLM infrastructure.

With $16.25M in funding from Khosla Ventures, Cerberus, and Google, PrismML has the backing to develop the 1-bit quantization toolchain into a comprehensive platform. The Bonsai 8B, 4B, and 1.7B models provide different capability-efficiency tradeoffs for various deployment scenarios. The 355-point Hacker News Show HN launch and positive reception on r/LocalLLaMA confirm strong community interest in edge-efficient LLMs. For developers building on-device AI experiences, Bonsai represents the most practical path to running capable models without cloud dependencies.

Pricing

Models free (Apache 2.0); company VC-funded, tooling may commercialize

Platforms

Custom llama.cpp/MLX forks; HuggingFace; runs on iPhone, desktop, edge

Categories

Tags

Use Cases

Alternatives

MLC LLM logo

MLC LLM

Run LLMs natively on any device with ML compilation

MLC LLM is an open-source engine for deploying large language models natively across diverse platforms using machine learning compilation. It runs models on NVIDIA/AMD GPUs, Apple Silicon, mobile devices, and browsers via WebGPU without cloud dependencies. Features include OpenAI-compatible API, quantization support, and optimized backends for CUDA, Metal, Vulkan, and WebAssembly.

open-sourceOpen Source
ExecuTorch logo

ExecuTorch

PyTorch on-device AI for mobile and edge devices

ExecuTorch is PyTorch's official solution for deploying AI models on mobile, embedded, and edge devices. It features a 50KB base runtime, 12+ hardware backends including Apple CoreML, Qualcomm QNN, ARM, and Vulkan, and native PyTorch export without format conversions. Powers Meta's on-device AI across Instagram, WhatsApp, Quest 3, and Ray-Ban Smart Glasses, supporting LLMs, vision, speech, and multimodal models.

open-sourceOpen Source

Llamafile

Run LLMs as a single portable executable file

Llamafile by Mozilla packages a complete LLM — model weights, inference engine, and OpenAI-compatible API server — into a single executable file that runs on Mac, Windows, Linux, FreeBSD, and OpenBSD with no installation. Built on llama.cpp and Cosmopolitan Libc for cross-platform portability, it delivers GPU-accelerated inference when available and falls back to optimized CPU execution. Supports GGUF models with a built-in web chat UI and REST API for integration.

open-sourceOpen Source

Related Tools

KubeAI

Kubernetes operator for serving AI inference workloads

KubeAI is an Apache-2.0 Kubernetes operator for deploying and scaling AI inference workloads, including LLMs, embeddings, reranking, and speech-to-text. It gives platform teams OpenAI-compatible endpoints, model proxy/controller primitives, model caching, scale-from-zero behavior, and cluster-native resource management for self-hosted inference on Kubernetes.

open-sourceOpen Source
Deep Lake logo

Deep Lake

AI data runtime for multimodal datasets and vector search

Deep Lake is an open-source AI data runtime from Activeloop for storing, versioning, and querying multimodal data and embeddings. It fits teams building RAG, training, evaluation, or dataset-heavy agent workflows that need a bridge between vector search, structured metadata, and large image, text, audio, or video collections.

open-sourceOpen Source
SeekDB logo

SeekDB

AI-native state store with hybrid vector and full-text search

SeekDB is an open-source AI-native state store from the OceanBase ecosystem that combines MySQL-compatible data access with hybrid vector and full-text retrieval. It targets agent and AI application teams that need embedded or server deployment, copy-on-write style sandboxes, and searchable state without gluing together several separate storage layers.

open-sourceOpen Source

CLIProxyAPI

Self-hosted proxy API for routing AI CLI accounts into OpenAI-compatible endpoints

CLIProxyAPI is an open-source Go proxy server that wraps Gemini CLI, Claude Code, OpenAI Codex, Grok Build, and related CLI account flows behind OpenAI/Gemini/Claude-compatible API endpoints. Use it carefully: it can touch OAuth sessions, auth files, logs, and provider account policies, so production use needs credential and ToS review.

open-sourceOpen SourceTelemetry
OpenHuman logo

OpenHuman

Local-first personal AI agent with memory trees, desktop integrations, and private workspace context.

OpenHuman is an open-source, local-first personal AI agent from TinyHumans. It combines a desktop app, persistent memory trees, Obsidian-compatible storage, OAuth integrations, and local model support into a private assistant harness. It is most interesting for users who want agentic workflows and long-term memory without handing every context detail to a fully cloud-hosted assistant.

open-sourceOpen SourceTelemetry
DenchClaw logo

DenchClaw

Local AI CRM and workflow automation on OpenClaw

DenchClaw is a local AI CRM and workflow automation app built on OpenClaw. It runs on a Mac at localhost, lets users chat with local business data, and focuses on lead enrichment, founder/customer research, and outreach automation. It belongs beside local AI, workflow automation, and OpenClaw-style personal-agent tools rather than pure coding IDEs.

open-sourceOpen Source

Comparisons