llama.cpp

High-performance local LLM inference in C/C++

open-sourceOpen Source

llama.cpp is the foundational C/C++ library with 75K+ GitHub stars powering local LLM inference on consumer hardware. Provides optimized CPU and GPU inference for quantized models in GGUF format. Supports LLaMA, Mistral, Phi, Gemma, and most open-weight families. Features 2-8 bit quantization for reduced memory, multi-GPU support, context extension, grammar-constrained output, and an OpenAI-compatible API server. The engine behind Ollama and LM Studio.

llama.cpp is the foundational library for running LLMs on consumer hardware. With 75K+ stars, it powers Ollama, LM Studio, and many local AI applications.

Optimized for CPU (AVX/AVX2/AVX-512), Apple Silicon (Metal), NVIDIA (CUDA), and AMD (ROCm). GGUF format quantization from 2-bit to 8-bit reduces memory while maintaining quality.

Supports LLaMA, Mistral, Phi, Gemma, Qwen, and virtually all open-weight models. Features context extension, grammar-constrained output, batch processing, and speculative decoding.

Built-in HTTP server provides OpenAI-compatible API for seamless local inference. Continuously optimized by a large community.

Pricing

Free and open-source

Platforms

CPU, CUDA, Metal, ROCm, any OS

Use Cases

Local AI Workflows AI Model Training

Alternatives

RunAnywhere SDK

Cross-platform on-device AI inference SDK

RunAnywhere SDK is a production-ready toolkit for running AI models entirely on-device across iOS, macOS, Android, Web, React Native, and Flutter. It provides a unified C++ core with platform-specific bindings for LLM text generation via llama.cpp, vision-language models, Whisper speech-to-text, Piper text-to-speech, and on-device image generation. All processing stays local with zero cloud dependency, ensuring privacy and low latency for mobile and edge AI applications.

open-sourceOpen Source

Nexa SDK

Cross-platform on-device AI model runtime

Nexa SDK enables running frontier LLMs and multimodal models locally across PC, mobile, IoT, and wearables with automatic hardware acceleration for GPU, NPU, and CPU. It supports Qwen, Gemma, Llama, DeepSeek models with Python/C++ desktop SDKs, Android/iOS mobile SDKs, and Docker for edge deployment. Includes an OpenAI-compatible API server with chat and function calling support.

open-sourceOpen Source

Related Tools

Claude

Top Pick

Anthropic's frontier AI assistant

Anthropic's AI assistant known for strong reasoning, nuanced writing, and extended context up to 200K tokens. Available in Opus (most capable), Sonnet (balanced), and Haiku (fast) tiers. Features web search, deep research, file analysis, code execution, artifacts, and Projects for organized workflows. Claude Code provides terminal-based agentic coding. API supports tool use, batch processing, and prompt caching. Available via claude.ai, mobile apps, and developer API.

freemium

xAI Python SDK

Official Python SDK for the xAI API

The xAI Python SDK is the official Python client for the xAI API, giving developers a direct way to build Grok-powered apps without relying on community proxies or unofficial wrappers. It supports synchronous and asynchronous Python clients for chat completions, streaming responses, function/tool calling, and multimodal workflows, making it a clean fit for backend services, agents, notebooks, and developer tools that need programmatic xAI access.

open-sourceOpen Source

Cerebras

Wafer-scale inference at thousands of tokens per second

Cerebras Inference serves open-weight LLMs like Llama, Qwen, and GPT-OSS on wafer-scale CS-3 chips through an OpenAI-compatible API, benchmarking between 1,800 and 2,600 output tokens per second on Llama 3.1 8B and several hundred on 70B models. A free tier offers one million tokens per day with no credit card, while paid pay-per-token pricing starts at $0.04 per million tokens for the smaller Llama models.

freemium

Chatbox

One desktop app for every LLM — private, cross-platform, extensible

Chatbox is a cross-platform desktop AI client supporting OpenAI, Claude, Gemini, DeepSeek, and local models via Ollama. All chat data stays on-device, making it ideal for privacy-conscious developers. Features include document analysis, code assistance with syntax highlighting, image generation, web search, and a local knowledge base for private Q&A. Available on Windows, macOS, Linux, Android, iOS, and web.

freemiumOpen Source

Baseten

ML inference platform for production AI models

Baseten is the inference platform for deploying AI models at scale with dedicated and pre-optimized model APIs and performance-optimized infrastructure. Specializes in image generation, transcription, text-to-speech, LLM serving, embeddings, and compound AI workloads. Delivers 75% latency reduction with 415ms cold starts and 3000+ concurrent scaling. Available as managed cloud or self-hosted, trusted by Cursor, Notion, Descript, and Sourcegraph for production inference.

api-usage-based

Triton Inference Server

NVIDIA's optimized AI model serving platform

Triton Inference Server is NVIDIA's open-source inference serving platform that deploys AI models from TensorRT, PyTorch, ONNX, TensorFlow, OpenVINO, Python, and more across cloud, data center, and edge environments. It supports dynamic batching, model ensembles, concurrent model execution on GPUs and CPUs, and real-time, streaming, and batch inference patterns. Includes Model Analyzer for profiling and Model Navigator for automated optimization.

open-sourceOpen Source

Comparisons

Ollama vs llama.cpp — Local LLM Wrapper vs the Inference Engine It Wraps

Ollama and llama.cpp both let you run open-weight models on your own hardware, but they sit at different layers of the stack. llama.cpp is the C/C++ inference engine that started the local-LLM movement and quietly powers a huge slice of the ecosystem. Ollama is the Go-based developer wrapper that hides the rough edges and turned local models into a one-line install for everyone else.

Ollamallama.cpp

llama.cpp

Pricing

Platforms

Categories

Tags

Use Cases

Alternatives

RunAnywhere SDK

Nexa SDK

Related Tools

Claude

xAI Python SDK

Cerebras

Chatbox

Baseten

Triton Inference Server

Comparisons

Ollama vs llama.cpp — Local LLM Wrapper vs the Inference Engine It Wraps