# gpu-accelerated
25 tools tagged
Showing 24 of 25 tools
Baseten
ML inference platform for production AI models
Baseten is the inference platform for deploying AI models at scale with dedicated and pre-optimized model APIs and performance-optimized infrastructure. Specializes in image generation, transcription, text-to-speech, LLM serving, embeddings, and compound AI workloads. Delivers 75% latency reduction with 415ms cold starts and 3000+ concurrent scaling. Available as managed cloud or self-hosted, trusted by Cursor, Notion, Descript, and Sourcegraph for production inference.
Sonic
ByteDance high-performance JSON library
Sonic is ByteDance's blazingly fast JSON serialization library accelerated by JIT compilation and SIMD instructions. It achieves 3x faster throughput than Go's standard library while using 75% less memory and 99% fewer allocations. Drop-in compatible with encoding/json, it handles both simple Marshal/Unmarshal operations and streaming APIs for high-throughput services processing millions of events.
Triton Inference Server
NVIDIA's optimized AI model serving platform
Triton Inference Server is NVIDIA's open-source inference serving platform that deploys AI models from TensorRT, PyTorch, ONNX, TensorFlow, OpenVINO, Python, and more across cloud, data center, and edge environments. It supports dynamic batching, model ensembles, concurrent model execution on GPUs and CPUs, and real-time, streaming, and batch inference patterns. Includes Model Analyzer for profiling and Model Navigator for automated optimization.
FlashAttention
Fast memory-efficient GPU attention kernels
FlashAttention is a fast and memory-efficient exact attention implementation that reduces GPU memory usage from quadratic to linear in sequence length. Created by Tri Dao, it achieves 3-4x speedups over baseline implementations through IO-aware tiling that minimizes HBM reads and writes. Versions include FlashAttention-2 with improved parallelism, FlashAttention-3 optimized for Hopper H100 GPUs, and FlashAttention-4 targeting Hopper and Blackwell architectures.
RamaLama
Container-native local AI model serving with Podman
RamaLama is an open-source tool that containerizes AI model inference using Podman or Docker, eliminating host system configuration complexity. It auto-detects GPUs (NVIDIA, AMD, Intel, Apple Silicon), pulls models from HuggingFace, Ollama, and OCI registries, and runs them in isolated rootless containers with read-only mounts and network isolation. Developed under the Containers project (Red Hat ecosystem), it brings familiar container workflows to local LLM serving.
DeepGEMM
DeepSeek's FP8 general matrix multiplication kernels for efficient inference
DeepGEMM is DeepSeek's open-source library of FP8 matrix multiplication CUDA kernels optimized for LLM inference and training on modern NVIDIA GPUs. It provides efficient GEMM operations using 8-bit floating point precision that reduce memory bandwidth requirements while maintaining model accuracy. Designed for integration into inference engines and training frameworks. Over 6,300 GitHub stars.
DeepEP
DeepSeek's expert-parallel communication library for MoE model training
DeepEP is DeepSeek's open-source communication library optimized for expert-parallel training of Mixture-of-Experts models. It provides efficient GPU-to-GPU data routing for distributing tokens to expert networks across multiple devices during MoE model training and inference. Enables the distributed expert parallelism that powers DeepSeek's competitive model efficiency. Over 9,100 GitHub stars.
Fish Speech
Multilingual emotional text-to-speech with 80+ language support
Fish Speech is an open-source text-to-speech system supporting 80+ languages with emotional expression, zero-shot voice cloning, and real-time streaming. It generates natural speech with controllable emotions, speaking styles, and prosody. Features a web interface, API server, and integration with AI agent frameworks for voice-enabled applications. Over 29,000 GitHub stars.
GPT-SoVITS
Open-source voice cloning and text-to-speech with few-shot learning
GPT-SoVITS is an open-source voice cloning and text-to-speech system that generates natural-sounding speech from just a few seconds of reference audio. It combines GPT-style language modeling with SoVITS voice synthesis for zero-shot and few-shot voice cloning across multiple languages. Supports Chinese, English, Japanese, Korean, and Cantonese with over 56,000 GitHub stars.
FlashMLA
DeepSeek's optimized attention kernel for Multi-Head Latent Attention
FlashMLA is DeepSeek's MIT-licensed CUDA kernel library for optimized attention in DeepSeek-V3 and DeepSeek-V3.2-Exp style inference. It includes dense MLA decoding plus sparse attention kernels for DeepSeek Sparse Attention, with README-reported H800/CUDA metrics up to 3000 GB/s, 660 TFLOPS, and sparse 640/410 TFlops paths. It has 12.7K+ GitHub stars.
ms-swift
ModelScope's fine-tuning framework supporting 600+ models
ms-swift is ModelScope's open-source framework for fine-tuning over 600 large language and multimodal models. It supports SFT, DPO, RLHF, LoRA, QLoRA, and full fine-tuning with a web UI and CLI interface. Optimized for the Chinese AI ecosystem with native ModelScope Hub integration alongside Hugging Face support. Over 13,500 GitHub stars.
Oumi
End-to-end open-source platform for training and evaluating foundation models
Oumi is an end-to-end open-source platform for training, fine-tuning, and evaluating foundation models at any scale. It covers data preparation, distributed training, reinforcement learning from human feedback, evaluation benchmarks, and model deployment in a unified framework. Supports training from scratch to post-training alignment with over 9,100 GitHub stars.
LoRAX
Multi-LoRA inference server for serving hundreds of fine-tuned models
LoRAX is an inference server that serves hundreds of fine-tuned LoRA models from a single base model deployment. It dynamically loads and unloads LoRA adapters on demand, sharing the base model's GPU memory across all adapters. Built on text-generation-inference with OpenAI-compatible API. Enables multi-tenant model serving without per-model GPU allocation. Over 3,700 GitHub stars.
torchtune
Meta's official PyTorch library for LLM fine-tuning
torchtune is Meta's official PyTorch-native library for fine-tuning large language models. It provides composable building blocks for training recipes covering LoRA, QLoRA, full fine-tuning, DPO, and knowledge distillation. Supports Llama, Mistral, Gemma, Qwen, and Phi model families with distributed training across multiple GPUs. Designed as a hackable, dependency-minimal alternative to higher-level frameworks.
Modal
Serverless GPU compute platform for AI inference and training
Modal is a serverless compute platform that lets developers run AI workloads on GPUs with a Python-first SDK. Functions deploy with decorators, auto-scale from zero to thousands of containers, and bill per second. It supports LLM inference, fine-tuning, batch jobs, and sandboxes, with current GPU options including B200, H200, H100, A100, L40S, A10, L4, and T4. Modal’s 2026 Series C valued the company at $4.65B.
Ray
Distributed AI compute engine for scaling Python and ML workloads
Ray is an open-source distributed computing framework built for scaling AI and Python applications from a laptop to thousands of GPUs. It provides libraries for distributed training, hyperparameter tuning, model serving, reinforcement learning, and data processing under a single unified API. Ray's public site highlights OpenAI and other enterprise users. Maintained by Anyscale with Apache-2.0 open-source licensing.
LLaMA-Factory
Unified framework for fine-tuning 100+ large language models
LLaMA-Factory is an open-source toolkit providing a unified interface for fine-tuning over 100 LLMs and vision-language models. It supports SFT, RLHF with PPO and DPO, LoRA and QLoRA for memory-efficient training, and continuous pre-training. The LLaMA Board web UI enables no-code configuration, while CLI and YAML workflows serve advanced users. Integrates with Hugging Face, ModelScope, vLLM, and SGLang for model deployment.
Dstack
Open-source control plane for AI workloads across multi-cloud GPU infrastructure
dstack is an open-source platform that orchestrates AI training and inference workloads across heterogeneous GPU infrastructure spanning multiple clouds, Kubernetes clusters, and bare-metal servers. It abstracts away cloud-specific APIs so teams define GPU requirements declaratively and dstack automatically provisions the cheapest available resources from AWS, GCP, Azure, Lambda, or on-premises hardware.
exo
Run frontier AI models across a cluster of everyday devices
exo turns multiple local machines into a unified AI compute cluster for models that exceed a single device's memory. It automatically discovers devices, uses topology-aware auto parallelism to split work across available resources, and supports RDMA over Thunderbolt 5 for co-located clusters or standard networking for looser setups. The project exposes OpenAI Chat Completions, Claude Messages, OpenAI Responses, and Ollama-compatible APIs plus a dashboard for cluster management.
Lemonade
AMD's open-source local LLM server with GPU and NPU acceleration
Lemonade is AMD's open-source local AI serving platform for LLMs, image generation, speech recognition, and text-to-speech on your own hardware. Built in lightweight C++, it can detect CPU, GPU, and NPU backends and is extra optimized for Ryzen AI, Radeon, and Strix Halo PCs. Lemonade exposes OpenAI, Anthropic, and Ollama-compatible APIs, ships with a desktop model manager, and supports source-confirmed GGUF, FLM, and ONNX models across Windows, Linux, macOS, and Docker.
Unsloth
2x faster LLM fine-tuning with 70% less VRAM on a single GPU
Unsloth is an open-source framework for fine-tuning large language models up to 2x faster while using 70% less VRAM. Built with custom Triton kernels, it supports 500+ model architectures including Llama 4, Qwen 3, and DeepSeek on consumer NVIDIA GPUs. Unsloth Studio adds a no-code web UI for dataset creation, training observability, model comparison, and GGUF export for Ollama and vLLM deployment.
llm-d
Kubernetes-native distributed LLM inference stack
llm-d is an open-source Kubernetes-native stack for distributed LLM inference with cache-aware routing and disaggregated serving. It separates prefill and decode stages across different GPU pools for optimal resource utilization, routes requests to nodes with warm KV caches, and integrates with vLLM as the serving engine. Apache-2.0 licensed with 2,900+ GitHub stars.
kitty
The fast, feature-rich terminal
GPU-accelerated terminal emulator written in C and Python, focused on performance and features. Supports ligatures, true color, graphics protocol for displaying images/plots inline, tabs, splits, and remote control via IPC. Highly configurable via a plain text config file. Cross-platform on macOS and Linux. Features a kitten framework for writing terminal programs in Python. Known for innovation in terminal graphics. 26K+ GitHub stars and a dedicated power-user community.
WezTerm
GPU-accelerated terminal with Lua config
GPU-accelerated cross-platform terminal emulator written in Rust with configuration in Lua for maximum flexibility. Supports multiplexing (splits, tabs, workspaces), ligatures, true color, sixel/iTerm2/Kitty image protocols, and SSH multiplexer for remote sessions. Extensive keyboard/mouse customization, dynamic color schemes, and a built-in serial port mode. Works on macOS, Linux, Windows, and FreeBSD. Known for deep customizability. 19K+ GitHub stars.