aicoolies logo
MLC LLM logo

MLC LLM

Run LLMs natively on any device with ML compilation

Share
open-sourceOpen Source
Visit Website →

MLC LLM is an open-source engine for deploying large language models natively across diverse platforms using machine learning compilation. It runs models on NVIDIA/AMD GPUs, Apple Silicon, mobile devices, and browsers via WebGPU without cloud dependencies. Features include OpenAI-compatible API, quantization support, and optimized backends for CUDA, Metal, Vulkan, and WebAssembly.

MLC LLM uses machine learning compilation technology to deploy large language models natively on virtually any hardware platform. Rather than relying on framework-specific runtimes, it compiles models into optimized native code for the target platform using Apache TVM's compiler infrastructure. This approach enables running LLMs on NVIDIA GPUs via CUDA, AMD GPUs via ROCm/Vulkan, Apple Silicon via Metal, mobile devices via Android NDK and iOS, and even web browsers via WebGPU — all from the same model definition.

The project provides pre-compiled model libraries for popular architectures including Llama, Mistral, Gemma, Phi, and Qwen, along with tools for compiling custom models. It offers an OpenAI-compatible REST API server for drop-in replacement in existing applications, chat CLI for interactive use, and Python/JavaScript/Swift APIs for embedding in applications. Quantization support includes group quantization and mixed-precision modes to reduce memory requirements while maintaining generation quality.

MLC LLM is open-source under Apache 2.0, developed by the MLC AI community with roots in CMU research. It distinguishes itself from tools like llama.cpp by using compiler-based optimization rather than hand-tuned kernels, which enables automatic optimization for new hardware targets. The project maintains active development with regular model updates and platform support improvements, making it a strong choice for developers who need to deploy LLMs across heterogeneous hardware without maintaining separate deployment paths for each platform.

Pricing

Free and open-source (Apache 2.0)

Platforms

Python/CLI — GPU, CPU, mobile, browser via WebGPU

Categories

Tags

Use Cases

Alternatives

ExecuTorch logo

ExecuTorch

PyTorch on-device AI for mobile and edge devices

ExecuTorch is PyTorch's official solution for deploying AI models on mobile, embedded, and edge devices. It features a 50KB base runtime, 12+ hardware backends including Apple CoreML, Qualcomm QNN, ARM, and Vulkan, and native PyTorch export without format conversions. Powers Meta's on-device AI across Instagram, WhatsApp, Quest 3, and Ray-Ban Smart Glasses, supporting LLMs, vision, speech, and multimodal models.

open-sourceOpen Source
TensorFlow Lite logo

TensorFlow Lite

Google's lightweight ML framework for mobile and embedded

TensorFlow Lite is Google's lightweight ML framework for deploying models on mobile and embedded devices. It supports quantization, GPU/NPU delegation, and runs on Android, iOS, Linux, and microcontrollers. Provides pre-trained models, model conversion tools from TensorFlow and JAX, and hardware acceleration via GPU, Hexagon DSP, and CoreML delegates. Powers on-device ML in billions of Google app installations.

open-sourceOpen Source

OpenVINO

Intel's open-source AI inference optimization toolkit

OpenVINO is Intel's open-source toolkit for optimizing and deploying AI inference across CPUs, GPUs, and NPUs. It supports models from PyTorch, TensorFlow, ONNX, and TFLite, providing graph optimizations, quantization, and hardware-specific acceleration. The toolkit includes a GenAI API for LLM deployment and runs on Intel, ARM, and x86 platforms for edge, desktop, and cloud inference workloads.

open-sourceOpen Source
ONNX Runtime logo

ONNX Runtime

Cross-platform high-performance ML inference engine

ONNX Runtime is Microsoft's open-source inference engine for machine learning models in ONNX format. It delivers cross-platform acceleration via execution providers for NVIDIA CUDA, TensorRT, DirectML, CoreML, OpenVINO, and more. Supports training acceleration, quantization, and GenAI workloads. Used in production across Windows, Azure, Office 365, and thousands of applications with pip-installable Python and native C++/C#/Java APIs.

open-sourceOpen Source

Related Tools

Claude

Claude

Top Pick

Anthropic's frontier AI assistant

Anthropic's AI assistant known for strong reasoning, nuanced writing, and extended context up to 200K tokens. Available in Opus (most capable), Sonnet (balanced), and Haiku (fast) tiers. Features web search, deep research, file analysis, code execution, artifacts, and Projects for organized workflows. Claude Code provides terminal-based agentic coding. API supports tool use, batch processing, and prompt caching. Available via claude.ai, mobile apps, and developer API.

freemium
xAI Python SDK logo

xAI Python SDK

Official Python SDK for the xAI API

The xAI Python SDK is the official Python client for the xAI API, giving developers a direct way to build Grok-powered apps without relying on community proxies or unofficial wrappers. It supports synchronous and asynchronous Python clients for chat completions, streaming responses, function/tool calling, and multimodal workflows, making it a clean fit for backend services, agents, notebooks, and developer tools that need programmatic xAI access.

open-sourceOpen Source
Cerebras logo

Cerebras

Wafer-scale inference at thousands of tokens per second

Cerebras Inference serves open-weight LLMs like Llama, Qwen, and GPT-OSS on wafer-scale CS-3 chips through an OpenAI-compatible API, benchmarking between 1,800 and 2,600 output tokens per second on Llama 3.1 8B and several hundred on 70B models. A free tier offers one million tokens per day with no credit card, while paid pay-per-token pricing starts at $0.04 per million tokens for the smaller Llama models.

freemium
Chatbox logo

Chatbox

One desktop app for every LLM — private, cross-platform, extensible

Chatbox is a cross-platform desktop AI client supporting OpenAI, Claude, Gemini, DeepSeek, and local models via Ollama. All chat data stays on-device, making it ideal for privacy-conscious developers. Features include document analysis, code assistance with syntax highlighting, image generation, web search, and a local knowledge base for private Q&A. Available on Windows, macOS, Linux, Android, iOS, and web.

freemiumOpen Source
Baseten logo

Baseten

ML inference platform for production AI models

Baseten is the inference platform for deploying AI models at scale with dedicated and pre-optimized model APIs and performance-optimized infrastructure. Specializes in image generation, transcription, text-to-speech, LLM serving, embeddings, and compound AI workloads. Delivers 75% latency reduction with 415ms cold starts and 3000+ concurrent scaling. Available as managed cloud or self-hosted, trusted by Cursor, Notion, Descript, and Sourcegraph for production inference.

api-usage-based
Nexa SDK logo

Nexa SDK

Cross-platform on-device AI model runtime

Nexa SDK enables running frontier LLMs and multimodal models locally across PC, mobile, IoT, and wearables with automatic hardware acceleration for GPU, NPU, and CPU. It supports Qwen, Gemma, Llama, DeepSeek models with Python/C++ desktop SDKs, Android/iOS mobile SDKs, and Docker for edge deployment. Includes an OpenAI-compatible API server with chat and function calling support.

open-sourceOpen Source