Loading...
Loading...
Running AI models and agents locally without cloud dependencies
Showing 24 of 67 tools
YC-backed multimodal RAG platform for documents, images, and video
Morphik is a YC-backed multimodal RAG platform that ingests and retrieves information from documents, images, tables, and video content. It processes complex document layouts including charts, diagrams, and multi-column formats that traditional text-only RAG systems handle poorly. Provides API-first integration for building knowledge bases that understand visual as well as textual information.
Multilingual emotional text-to-speech with 80+ language support
Fish Speech is an open-source text-to-speech system supporting 80+ languages with emotional expression, zero-shot voice cloning, and real-time streaming. It generates natural speech with controllable emotions, speaking styles, and prosody. Features a web interface, API server, and integration with AI agent frameworks for voice-enabled applications. Over 29,000 GitHub stars.
Open-source voice cloning and text-to-speech with few-shot learning
GPT-SoVITS is an open-source voice cloning and text-to-speech system that generates natural-sounding speech from just a few seconds of reference audio. It combines GPT-style language modeling with SoVITS voice synthesis for zero-shot and few-shot voice cloning across multiple languages. Supports Chinese, English, Japanese, Korean, and Cantonese with over 56,000 GitHub stars.
Local model inference engine with OpenAI-compatible API and web UI
Xinference is a local inference engine that runs LLMs, embedding models, image generation, and audio models with an OpenAI-compatible API. It provides a web dashboard for model management, supports vLLM, llama.cpp, and transformers backends, and handles multi-GPU deployment automatically. Supports 100+ models including Qwen, Llama, Mistral, and DeepSeek with over 9,200 GitHub stars.
ModelScope's fine-tuning framework supporting 600+ models
ms-swift is ModelScope's open-source framework for fine-tuning over 600 large language and multimodal models. It supports SFT, DPO, RLHF, LoRA, QLoRA, and full fine-tuning with a web UI and CLI interface. Optimized for the Chinese AI ecosystem with native ModelScope Hub integration alongside Hugging Face support. Over 13,500 GitHub stars.
All-in-one embeddings database with RAG, search, and agent capabilities
txtai is a self-contained AI search and RAG platform that combines vector embeddings, semantic search, LLM pipelines, and agent workflows in a single Python library. It handles embedding generation, similarity search, extractive QA, summarization, translation, and custom pipelines without external dependencies. Runs locally with over 12,400 GitHub stars and Apache 2.0 license.
OpenAI's open-source speech recognition model for any language
Whisper is OpenAI's open-source automatic speech recognition model trained on 680,000 hours of multilingual audio data. It supports transcription and translation across 99 languages with robust handling of accents, background noise, and technical vocabulary. Available in multiple model sizes from tiny (39M) to large (1.5B parameters) for balancing accuracy and speed.
Meta's official PyTorch library for LLM fine-tuning
torchtune is Meta's official PyTorch-native library for fine-tuning large language models. It provides composable building blocks for training recipes covering LoRA, QLoRA, full fine-tuning, DPO, and knowledge distillation. Supports Llama, Mistral, Gemma, Qwen, and Phi model families with distributed training across multiple GPUs. Designed as a hackable, dependency-minimal alternative to higher-level frameworks.
Unified framework for fine-tuning 100+ large language models
LLaMA-Factory is an open-source toolkit providing a unified interface for fine-tuning over 100 LLMs and vision-language models. It supports SFT, RLHF with PPO and DPO, LoRA and QLoRA for memory-efficient training, and continuous pre-training. The LLaMA Board web UI enables no-code configuration, while CLI and YAML workflows serve advanced users. Integrates with Hugging Face, ModelScope, vLLM, and SGLang for model deployment.
Production RAG engine with hybrid search and knowledge graphs
R2R is a production-grade RAG engine from SciPhi AI that combines hybrid search with knowledge graph extraction and agentic retrieval capabilities. It provides a complete pipeline from document ingestion through retrieval and generation, supporting vector, keyword, and graph-based search strategies. The managed API and self-hosted options make it accessible for both rapid prototyping and production deployments requiring advanced retrieval beyond simple vector similarity.
RAG-based document QA with multi-user support and agent reasoning
Kotaemon is an open-source RAG-powered document question-answering interface backed by Cinnamon AI. It supports multi-user workspaces with access controls, advanced retrieval pipelines including hybrid search and knowledge graph extraction, and agentic reasoning for complex multi-step queries. The web UI handles PDFs, Office documents, and images with citations pointing to exact source passages, making it suitable for both individual research and team knowledge management.
On-device AI inference engine for mobile and wearable applications
Cactus is a YC-backed open-source inference engine built specifically for running LLMs, vision models, and embeddings on smartphones, tablets, and wearable devices. It provides native SDKs for iOS, Android, Flutter, and React Native with optimized ARM CPU and Apple NPU execution paths. Cactus achieves the fastest inference speeds on ARM processors with 10x lower RAM usage compared to generic runtimes, enabling privacy-first AI applications that run entirely on-device.
Microsoft's framework for running 1-bit large language models on consumer CPUs
BitNet is Microsoft's official inference framework for 1-bit quantized large language models that enables running models with up to 100 billion parameters on standard consumer CPUs without requiring a GPU. By leveraging extreme quantization where weights use only 1.58 bits on average, BitNet achieves dramatic reductions in memory footprint and computational cost while maintaining competitive output quality for many practical use cases.
Run frontier AI models across a cluster of everyday devices
exo turns a collection of everyday devices — laptops, desktops, phones — into a unified AI compute cluster capable of running large language models that no single device could handle alone. It automatically partitions models across available hardware using dynamic model sharding, supports heterogeneous device types including Apple Silicon, NVIDIA, and AMD GPUs, and communicates over standard networking without requiring specialized interconnects.
AMD's open-source local LLM server with GPU and NPU acceleration
Lemonade is AMD's open-source local AI serving platform that runs LLMs, image generation, speech recognition, and text-to-speech directly on your hardware. Built in lightweight C++, it automatically detects and configures optimal CPU, GPU, and NPU backends. Lemonade exposes an OpenAI-compatible API so existing applications work without code changes, and ships with a desktop app for model management and testing. Supports GGUF, ONNX, and SafeTensors across Windows, Linux, macOS, and Docker.
Persistent memory plugin for Claude Code with automatic context injection
Claude-Mem is a persistent memory plugin for Claude Code with 44,000+ GitHub stars that captures session context and injects it into future sessions. It features progressive disclosure with token cost visibility, automatic compression, and privacy controls with private tags to manage what gets remembered across coding sessions.
Single-file memory layer replacing complex RAG for AI agents
Memvid is an open-source single-file memory system for AI agents with 13,700+ GitHub stars. It replaces complex RAG infrastructure with instant retrieval from portable .mv2 files, claiming 35% accuracy improvement over state-of-the-art on LoCoMo benchmarks with 0.025ms P50 latency. Available for Python, Node.js, Rust, and CLI.
Enterprise-grade RAG and MCP knowledge base with one-click deployment
MaxKB is an enterprise-grade RAG platform with 21,000+ GitHub stars from the 1Panel team. It provides one-click deployment of knowledge bases with built-in LLM integration, MCP support, and a streamlined approach to document ingestion and retrieval that prioritizes operational simplicity over configuration complexity.
No-code knowledge base platform with visual AI workflow and built-in RAG
FastGPT is an open-source no-code AI knowledge base platform with 27,000+ GitHub stars and 500,000+ users worldwide. It combines visual workflow orchestration, built-in RAG pipelines, QA-pair extraction, and API-aligned completions into a single deployable stack that runs on just 2GB RAM via Docker one-liner deployment.
2x faster LLM fine-tuning with 70% less VRAM on a single GPU
Unsloth is an open-source framework for fine-tuning large language models up to 2x faster while using 70% less VRAM. Built with custom Triton kernels, it supports 500+ model architectures including Llama 4, Qwen 3, and DeepSeek on consumer NVIDIA GPUs. Unsloth Studio adds a no-code web UI for dataset creation, training observability, model comparison, and GGUF export for Ollama and vLLM deployment.
Microsoft's open-source frontier voice AI for long-form multi-speaker audio
VibeVoice is Microsoft's open-source voice AI family with both TTS and speech recognition models. The TTS model generates up to 90 minutes of expressive multi-speaker audio with 4 distinct voices. VibeVoice-ASR transcribes 60-minute recordings in a single pass with speaker identification and timestamps. Built on continuous speech tokenizers at 7.5 Hz and next-token diffusion, it compresses audio 80x more efficiently than Encodec while preserving fidelity.
One-command local coding agent that auto-detects your hardware and picks the best model
hf-agents is a Hugging Face CLI extension that detects your hardware, recommends the best GGUF model using llmfit, and launches a local coding agent in a single command. It collapses the multi-step local LLM setup into hf agents run pi, automatically handling hardware profiling, model download, inference server startup, and coding agent activation.
First commercially viable 1-bit LLMs that are 14x smaller and 8x faster
PrismML Bonsai delivers the first commercially viable 1-bit large language models with 8B, 4B, and 1.7B parameter variants. The 8B model runs in just 1GB of RAM versus 16GB for standard FP16 models, achieving 44 tokens per second on iPhone. Backed by $16.25M from Khosla Ventures and released under Apache 2.0, Bonsai makes capable LLMs practical for edge devices and resource-constrained environments.
Find which AI models actually run on your hardware in one command
llmfit is a Rust-based terminal tool that matches over 200 LLM models from 30+ providers against your exact hardware specs. The interactive TUI scores each model on fit, speed, VRAM usage, and context length, helping you avoid downloading models that won't run on your machine. It supports Ollama, llama.cpp, MLX, Docker Model Runner, and LM Studio backends.