# local-llm

11 tools tagged

Showing 11 of 11 tools

Nexa SDK

Cross-platform on-device AI model runtime

Nexa SDK enables running frontier LLMs and multimodal models locally across PC, mobile, IoT, and wearables with automatic hardware acceleration for GPU, NPU, and CPU. It supports Qwen, Gemma, Llama, DeepSeek models with Python/C++ desktop SDKs, Android/iOS mobile SDKs, and Docker for edge deployment. Includes an OpenAI-compatible API server with chat and function calling support.

open-sourceOpen Source

Twinny

Free local AI code completion for VSCode

Twinny is a free, open-source AI code completion extension for VS Code that works with any OpenAI API-compatible endpoint including local models via Ollama. It provides real-time inline suggestions, a sidebar chat for code discussion, workspace embeddings for context-aware completions, and a decentralized P2P inference network. A privacy-first alternative to GitHub Copilot with zero licensing cost.

open-sourceOpen Source

RunAnywhere SDK

Cross-platform on-device AI inference SDK

RunAnywhere SDK is a production-ready toolkit for running AI models entirely on-device across iOS, macOS, Android, Web, React Native, and Flutter. It provides a unified C++ core with platform-specific bindings for LLM text generation via llama.cpp, vision-language models, Whisper speech-to-text, Piper text-to-speech, and on-device image generation. All processing stays local with zero cloud dependency, ensuring privacy and low latency for mobile and edge AI applications.

open-sourceOpen Source

RamaLama

Container-native local AI model serving with Podman

RamaLama is an open-source tool that containerizes AI model inference using Podman or Docker, eliminating host system configuration complexity. It auto-detects GPUs (NVIDIA, AMD, Intel, Apple Silicon), pulls models from HuggingFace, Ollama, and OCI registries, and runs them in isolated rootless containers with read-only mounts and network isolation. Developed under the Containers project (Red Hat ecosystem), it brings familiar container workflows to local LLM serving.

open-sourceOpen Source

gptme

Personal AI agent in your terminal with local tool access

gptme is one of the earliest terminal-based AI agent CLIs, launched in spring 2023. It equips a personal AI agent with local tools to run shell commands, write and edit code, browse the web via Playwright, and use vision capabilities. Supports MCP integration and an extensible plugin system for building persistent autonomous agents. The Bob reference agent is documented as having run extensively as an autonomous agent.

free

QMD

On-device hybrid search engine for your docs and notes

QMD is an on-device search engine built by Tobi Lütke (Shopify CEO) that indexes markdown notes, meeting transcripts, and documentation locally. It combines BM25 full-text search, vector semantic search, and LLM-powered re-ranking into a single hybrid pipeline. Ships with a built-in MCP server for seamless integration with Claude Code, Cursor, and other AI editors. All processing happens on your machine via node-llama-cpp with GGUF models — zero cloud dependency.

free

TaxHacker

Self-hosted AI accounting for freelancers and small teams

TaxHacker is an open-source, self-hosted AI accounting app that automatically extracts financial data from receipts, invoices, and bank statements using LLMs. It supports 170+ currencies and 14 cryptocurrencies with historical exchange rate conversion, multi-project accounting, and custom AI extraction fields. Works with OpenAI, Gemini, Mistral, or local models via Ollama—deploy with Docker and keep all financial data under your control.

open-sourceOpen Source

llama-swap

Hot-swap between local LLM models via OpenAI-compatible API

llama-swap is an open-source tool that manages multiple local LLM models behind a single OpenAI-compatible API endpoint. It automatically loads and unloads models on demand, letting developers hot-swap between different models without restarting services. With 3.1K+ GitHub stars, it solves the common pain point of running multiple specialized models on limited hardware.

open-sourceOpen Source

Llamafile

Run LLMs as a single portable executable file

Llamafile by Mozilla packages a complete LLM — model weights, inference engine, and OpenAI-compatible API server — into a single executable file that runs on Mac, Windows, Linux, FreeBSD, and OpenBSD with no installation. Built on llama.cpp and Cosmopolitan Libc for cross-platform portability, it delivers GPU-accelerated inference when available and falls back to optimized CPU execution. Supports GGUF models with a built-in web chat UI and REST API for integration.

open-sourceOpen Source

MLC LLM

Run LLMs natively on any device with ML compilation

MLC LLM is an open-source engine for deploying large language models natively across diverse platforms using machine learning compilation. It runs models on NVIDIA/AMD GPUs, Apple Silicon, mobile devices, and browsers via WebGPU without cloud dependencies. Features include OpenAI-compatible API, quantization support, and optimized backends for CUDA, Metal, Vulkan, and WebAssembly.

open-sourceOpen Source

Jan

Offline-first AI assistant for local inference

Jan is an open-source offline-first AI assistant with 25K+ GitHub stars running LLMs locally without sending data externally. Features a ChatGPT-like interface with one-click model downloads from Hugging Face, conversation management, customizable prompts, and an OpenAI-compatible local API server. Supports GGUF models via llama.cpp with GPU acceleration on NVIDIA and Apple Silicon. Built with Electron for macOS, Windows, and Linux with full data privacy.

open-sourceOpen Source