21 tools tagged
Showing 21 of 21 tools
DeepSeek's FP8 general matrix multiplication kernels for efficient inference
DeepGEMM is DeepSeek's open-source library of FP8 matrix multiplication CUDA kernels optimized for LLM inference and training on modern NVIDIA GPUs. It provides efficient GEMM operations using 8-bit floating point precision that reduce memory bandwidth requirements while maintaining model accuracy. Designed for integration into inference engines and training frameworks. Over 6,300 GitHub stars.
DeepSeek's expert-parallel communication library for MoE model training
DeepEP is DeepSeek's open-source communication library optimized for expert-parallel training of Mixture-of-Experts models. It provides efficient GPU-to-GPU data routing for distributing tokens to expert networks across multiple devices during MoE model training and inference. Enables the distributed expert parallelism that powers DeepSeek's competitive model efficiency. Over 9,100 GitHub stars.
Multilingual emotional text-to-speech with 80+ language support
Fish Speech is an open-source text-to-speech system supporting 80+ languages with emotional expression, zero-shot voice cloning, and real-time streaming. It generates natural speech with controllable emotions, speaking styles, and prosody. Features a web interface, API server, and integration with AI agent frameworks for voice-enabled applications. Over 29,000 GitHub stars.
Open-source voice cloning and text-to-speech with few-shot learning
GPT-SoVITS is an open-source voice cloning and text-to-speech system that generates natural-sounding speech from just a few seconds of reference audio. It combines GPT-style language modeling with SoVITS voice synthesis for zero-shot and few-shot voice cloning across multiple languages. Supports Chinese, English, Japanese, Korean, and Cantonese with over 56,000 GitHub stars.
DeepSeek's optimized attention kernel for Multi-Head Latent Attention
FlashMLA is DeepSeek's open-source CUDA kernel implementing efficient Multi-Head Latent Attention, the attention mechanism used in DeepSeek-V2 and V3 models. It provides optimized GPU kernels that significantly reduce memory usage and improve inference speed for MLA-based architectures. Represents DeepSeek's contribution to open AI infrastructure with over 12,600 GitHub stars.
ModelScope's fine-tuning framework supporting 600+ models
ms-swift is ModelScope's open-source framework for fine-tuning over 600 large language and multimodal models. It supports SFT, DPO, RLHF, LoRA, QLoRA, and full fine-tuning with a web UI and CLI interface. Optimized for the Chinese AI ecosystem with native ModelScope Hub integration alongside Hugging Face support. Over 13,500 GitHub stars.
End-to-end open-source platform for training and evaluating foundation models
Oumi is an end-to-end open-source platform for training, fine-tuning, and evaluating foundation models at any scale. It covers data preparation, distributed training, reinforcement learning from human feedback, evaluation benchmarks, and model deployment in a unified framework. Supports training from scratch to post-training alignment with over 9,100 GitHub stars.
Multi-LoRA inference server for serving hundreds of fine-tuned models
LoRAX is an inference server that serves hundreds of fine-tuned LoRA models from a single base model deployment. It dynamically loads and unloads LoRA adapters on demand, sharing the base model's GPU memory across all adapters. Built on text-generation-inference with OpenAI-compatible API. Enables multi-tenant model serving without per-model GPU allocation. Over 3,700 GitHub stars.
Meta's official PyTorch library for LLM fine-tuning
torchtune is Meta's official PyTorch-native library for fine-tuning large language models. It provides composable building blocks for training recipes covering LoRA, QLoRA, full fine-tuning, DPO, and knowledge distillation. Supports Llama, Mistral, Gemma, Qwen, and Phi model families with distributed training across multiple GPUs. Designed as a hackable, dependency-minimal alternative to higher-level frameworks.
Serverless GPU compute platform for AI inference and training
Modal is a serverless compute platform that lets developers run AI workloads on GPUs with a Python-first SDK. Functions deploy with simple decorators, auto-scale from zero to thousands of containers, and bill per-second of actual use. Supports LLM inference, fine-tuning, batch processing, and sandboxed environments. Used by Meta, Scale AI, and Harvey. Valued at $1.1B after $87M Series B.
Distributed AI compute engine for scaling Python and ML workloads
Ray is an open-source distributed computing framework built for scaling AI and Python applications from a laptop to thousands of GPUs. It provides libraries for distributed training, hyperparameter tuning, model serving, reinforcement learning, and data processing under a single unified API. Used by OpenAI for ChatGPT training, Uber, Shopify, and Instacart. Maintained by Anyscale and part of the PyTorch Foundation.
Unified framework for fine-tuning 100+ large language models
LLaMA-Factory is an open-source toolkit providing a unified interface for fine-tuning over 100 LLMs and vision-language models. It supports SFT, RLHF with PPO and DPO, LoRA and QLoRA for memory-efficient training, and continuous pre-training. The LLaMA Board web UI enables no-code configuration, while CLI and YAML workflows serve advanced users. Integrates with Hugging Face, ModelScope, vLLM, and SGLang for model deployment.
Open-source control plane for AI workloads across multi-cloud GPU infrastructure
dstack is an open-source platform that orchestrates AI training and inference workloads across heterogeneous GPU infrastructure spanning multiple clouds, Kubernetes clusters, and bare-metal servers. It abstracts away cloud-specific APIs so teams define GPU requirements declaratively and dstack automatically provisions the cheapest available resources from AWS, GCP, Azure, Lambda, or on-premises hardware.
Run frontier AI models across a cluster of everyday devices
exo turns a collection of everyday devices — laptops, desktops, phones — into a unified AI compute cluster capable of running large language models that no single device could handle alone. It automatically partitions models across available hardware using dynamic model sharding, supports heterogeneous device types including Apple Silicon, NVIDIA, and AMD GPUs, and communicates over standard networking without requiring specialized interconnects.
AMD's open-source local LLM server with GPU and NPU acceleration
Lemonade is AMD's open-source local AI serving platform that runs LLMs, image generation, speech recognition, and text-to-speech directly on your hardware. Built in lightweight C++, it automatically detects and configures optimal CPU, GPU, and NPU backends. Lemonade exposes an OpenAI-compatible API so existing applications work without code changes, and ships with a desktop app for model management and testing. Supports GGUF, ONNX, and SafeTensors across Windows, Linux, macOS, and Docker.
2x faster LLM fine-tuning with 70% less VRAM on a single GPU
Unsloth is an open-source framework for fine-tuning large language models up to 2x faster while using 70% less VRAM. Built with custom Triton kernels, it supports 500+ model architectures including Llama 4, Qwen 3, and DeepSeek on consumer NVIDIA GPUs. Unsloth Studio adds a no-code web UI for dataset creation, training observability, model comparison, and GGUF export for Ollama and vLLM deployment.
Kubernetes-native distributed LLM inference stack
llm-d is an open-source Kubernetes-native stack for distributed LLM inference with cache-aware routing and disaggregated serving. It separates prefill and decode stages across different GPU pools for optimal resource utilization, routes requests to nodes with warm KV caches, and integrates with vLLM as the serving engine. Apache-2.0 licensed with 2,900+ GitHub stars.
Lightweight fast text editor known for speed
Lightweight, cross-platform code editor built in Rust for extreme speed. Opens instantly and handles large files without lag. Supports LSP for language intelligence, Vim and Emacs keybindings, and real-time multiplayer collaboration. With GPU-accelerated rendering and a growing extension ecosystem, Zed offers the fastest editing experience available for developers who prioritize responsiveness above all else.
The fast, feature-rich terminal
GPU-accelerated terminal emulator written in C and Python, focused on performance and features. Supports ligatures, true color, graphics protocol for displaying images/plots inline, tabs, splits, and remote control via IPC. Highly configurable via a plain text config file. Cross-platform on macOS and Linux. Features a kitten framework for writing terminal programs in Python. Known for innovation in terminal graphics. 26K+ GitHub stars and a dedicated power-user community.
GPU-accelerated terminal with Lua config
GPU-accelerated cross-platform terminal emulator written in Rust with configuration in Lua for maximum flexibility. Supports multiplexing (splits, tabs, workspaces), ligatures, true color, sixel/iTerm2/Kitty image protocols, and SSH multiplexer for remote sessions. Extensive keyboard/mouse customization, dynamic color schemes, and a built-in serial port mode. Works on macOS, Linux, Windows, and FreeBSD. Known for deep customizability. 19K+ GitHub stars.
A fast, cross-platform terminal
Self-described 'fastest terminal emulator in existence' — a GPU-accelerated, cross-platform terminal written in Rust focused on simplicity and performance. No tabs, splits, or built-in multiplexer — designed to pair with tmux or Zellij. Configured via YAML with a minimal feature set that prioritizes speed above all else. Supports true color, Vi mode, regex search, and clickable URLs. Available on macOS, Linux, Windows, and BSD. 57K+ GitHub stars.