llama.cpp is the foundational library for running LLMs on consumer hardware. With 75K+ stars, it powers Ollama, LM Studio, and many local AI applications.
Optimized for CPU (AVX/AVX2/AVX-512), Apple Silicon (Metal), NVIDIA (CUDA), and AMD (ROCm). GGUF format quantization from 2-bit to 8-bit reduces memory while maintaining quality.
Supports LLaMA, Mistral, Phi, Gemma, Qwen, and virtually all open-weight models. Features context extension, grammar-constrained output, batch processing, and speculative decoding.
Built-in HTTP server provides OpenAI-compatible API for seamless local inference. Continuously optimized by a large community.