Tools Categories Comparisons Stacks Reviews Use Cases Graveyard

The definitive knowledge graph for the modern AI stack.

ExploreTools Categories Comparisons Stacks

DiscoverReviews Use Cases Tags Graveyard

CompanyAbout Team

SponsoredCursor · 50% OFFPartner / affiliate link

© 2026 AI Coolies. Built for builders.

Tools tagged "GPU Accelerated" — aicoolies

tags/gpu-accelerated

# gpu-accelerated

25 tools tagged

Showing 24 of 25 tools

Baseten

ML inference platform for production AI models

Baseten is the inference platform for deploying AI models at scale with dedicated and pre-optimized model APIs and performance-optimized infrastructure. Specializes in image generation, transcription, text-to-speech, LLM serving, embeddings, and compound AI workloads. Delivers 75% latency reduction with 415ms cold starts and 3000+ concurrent scaling. Available as managed cloud or self-hosted, trusted by Cursor, Notion, Descript, and Sourcegraph for production inference.

api-usage-based

Sonic

ByteDance high-performance JSON library

Sonic is ByteDance's blazingly fast JSON serialization library accelerated by JIT compilation and SIMD instructions. It achieves 3x faster throughput than Go's standard library while using 75% less memory and 99% fewer allocations. Drop-in compatible with encoding/json, it handles both simple Marshal/Unmarshal operations and streaming APIs for high-throughput services processing millions of events.

open-sourceOpen Source

Triton Inference Server

NVIDIA's optimized AI model serving platform

Triton Inference Server is NVIDIA's open-source inference serving platform that deploys AI models from TensorRT, PyTorch, ONNX, TensorFlow, OpenVINO, Python, and more across cloud, data center, and edge environments. It supports dynamic batching, model ensembles, concurrent model execution on GPUs and CPUs, and real-time, streaming, and batch inference patterns. Includes Model Analyzer for profiling and Model Navigator for automated optimization.

open-sourceOpen Source

FlashAttention

Fast memory-efficient GPU attention kernels

FlashAttention is a fast and memory-efficient exact attention implementation that reduces GPU memory usage from quadratic to linear in sequence length. Created by Tri Dao, it achieves 3-4x speedups over baseline implementations through IO-aware tiling that minimizes HBM reads and writes. Versions include FlashAttention-2 with improved parallelism, FlashAttention-3 optimized for Hopper H100 GPUs, and FlashAttention-4 targeting Hopper and Blackwell architectures.

open-sourceOpen Source

RamaLama

Container-native local AI model serving with Podman

RamaLama is an open-source tool that containerizes AI model inference using Podman or Docker, eliminating host system configuration complexity. It auto-detects GPUs (NVIDIA, AMD, Intel, Apple Silicon), pulls models from HuggingFace, Ollama, and OCI registries, and runs them in isolated rootless containers with read-only mounts and network isolation. Developed under the Containers project (Red Hat ecosystem), it brings familiar container workflows to local LLM serving.

open-sourceOpen Source

DeepGEMM

DeepSeek's FP8 general matrix multiplication kernels for efficient inference

DeepGEMM is DeepSeek's open-source library of FP8 matrix multiplication CUDA kernels optimized for LLM inference and training on modern NVIDIA GPUs. It provides efficient GEMM operations using 8-bit floating point precision that reduce memory bandwidth requirements while maintaining model accuracy. Designed for integration into inference engines and training frameworks. Over 6,300 GitHub stars.

open-sourceOpen Source

DeepEP

DeepSeek's expert-parallel communication library for MoE model training

DeepEP is DeepSeek's open-source communication library optimized for expert-parallel training of Mixture-of-Experts models. It provides efficient GPU-to-GPU data routing for distributing tokens to expert networks across multiple devices during MoE model training and inference. Enables the distributed expert parallelism that powers DeepSeek's competitive model efficiency. Over 9,100 GitHub stars.

open-sourceOpen Source

Fish Speech

Multilingual emotional text-to-speech with 80+ language support

Fish Speech is an open-source text-to-speech system supporting 80+ languages with emotional expression, zero-shot voice cloning, and real-time streaming. It generates natural speech with controllable emotions, speaking styles, and prosody. Features a web interface, API server, and integration with AI agent frameworks for voice-enabled applications. Over 29,000 GitHub stars.

open-sourceOpen Source

GPT-SoVITS

Open-source voice cloning and text-to-speech with few-shot learning

GPT-SoVITS is an open-source voice cloning and text-to-speech system that generates natural-sounding speech from just a few seconds of reference audio. It combines GPT-style language modeling with SoVITS voice synthesis for zero-shot and few-shot voice cloning across multiple languages. Supports Chinese, English, Japanese, Korean, and Cantonese with over 56,000 GitHub stars.

open-sourceOpen Source

FlashMLA

DeepSeek's optimized attention kernel for Multi-Head Latent Attention

FlashMLA is DeepSeek's open-source CUDA kernel implementing efficient Multi-Head Latent Attention, the attention mechanism used in DeepSeek-V2 and V3 models. It provides optimized GPU kernels that significantly reduce memory usage and improve inference speed for MLA-based architectures. Represents DeepSeek's contribution to open AI infrastructure with over 12,600 GitHub stars.

open-sourceOpen Source

ms-swift

ModelScope's fine-tuning framework supporting 600+ models

ms-swift is ModelScope's open-source framework for fine-tuning over 600 large language and multimodal models. It supports SFT, DPO, RLHF, LoRA, QLoRA, and full fine-tuning with a web UI and CLI interface. Optimized for the Chinese AI ecosystem with native ModelScope Hub integration alongside Hugging Face support. Over 13,500 GitHub stars.

open-sourceOpen Source

Oumi

End-to-end open-source platform for training and evaluating foundation models

Oumi is an end-to-end open-source platform for training, fine-tuning, and evaluating foundation models at any scale. It covers data preparation, distributed training, reinforcement learning from human feedback, evaluation benchmarks, and model deployment in a unified framework. Supports training from scratch to post-training alignment with over 9,100 GitHub stars.

open-sourceOpen Source

LoRAX

Multi-LoRA inference server for serving hundreds of fine-tuned models

LoRAX is an inference server that serves hundreds of fine-tuned LoRA models from a single base model deployment. It dynamically loads and unloads LoRA adapters on demand, sharing the base model's GPU memory across all adapters. Built on text-generation-inference with OpenAI-compatible API. Enables multi-tenant model serving without per-model GPU allocation. Over 3,700 GitHub stars.

open-sourceOpen Source

torchtune

Meta's official PyTorch library for LLM fine-tuning

torchtune is Meta's official PyTorch-native library for fine-tuning large language models. It provides composable building blocks for training recipes covering LoRA, QLoRA, full fine-tuning, DPO, and knowledge distillation. Supports Llama, Mistral, Gemma, Qwen, and Phi model families with distributed training across multiple GPUs. Designed as a hackable, dependency-minimal alternative to higher-level frameworks.

open-sourceOpen Source

Modal

Serverless GPU compute platform for AI inference and training

Modal is a serverless compute platform that lets developers run AI workloads on GPUs with a Python-first SDK. Functions deploy with simple decorators, auto-scale from zero to thousands of containers, and bill per-second of actual use. Supports LLM inference, fine-tuning, batch processing, and sandboxed environments. Used by Meta, Scale AI, and Harvey. Valued at $1.1B after $87M Series B.

Ray

Distributed AI compute engine for scaling Python and ML workloads

Ray is an open-source distributed computing framework built for scaling AI and Python applications from a laptop to thousands of GPUs. It provides libraries for distributed training, hyperparameter tuning, model serving, reinforcement learning, and data processing under a single unified API. Used by OpenAI for ChatGPT training, Uber, Shopify, and Instacart. Maintained by Anyscale and part of the PyTorch Foundation.

open-sourceOpen Source

LLaMA-Factory

Unified framework for fine-tuning 100+ large language models

LLaMA-Factory is an open-source toolkit providing a unified interface for fine-tuning over 100 LLMs and vision-language models. It supports SFT, RLHF with PPO and DPO, LoRA and QLoRA for memory-efficient training, and continuous pre-training. The LLaMA Board web UI enables no-code configuration, while CLI and YAML workflows serve advanced users. Integrates with Hugging Face, ModelScope, vLLM, and SGLang for model deployment.

open-sourceOpen Source

Dstack

Open-source control plane for AI workloads across multi-cloud GPU infrastructure

dstack is an open-source platform that orchestrates AI training and inference workloads across heterogeneous GPU infrastructure spanning multiple clouds, Kubernetes clusters, and bare-metal servers. It abstracts away cloud-specific APIs so teams define GPU requirements declaratively and dstack automatically provisions the cheapest available resources from AWS, GCP, Azure, Lambda, or on-premises hardware.

open-sourceOpen Source

exo

Run frontier AI models across a cluster of everyday devices

exo turns a collection of everyday devices — laptops, desktops, phones — into a unified AI compute cluster capable of running large language models that no single device could handle alone. It automatically partitions models across available hardware using dynamic model sharding, supports heterogeneous device types including Apple Silicon, NVIDIA, and AMD GPUs, and communicates over standard networking without requiring specialized interconnects.

open-sourceOpen Source

Lemonade

AMD's open-source local LLM server with GPU and NPU acceleration

Lemonade is AMD's open-source local AI serving platform that runs LLMs, image generation, speech recognition, and text-to-speech directly on your hardware. Built in lightweight C++, it automatically detects and configures optimal CPU, GPU, and NPU backends. Lemonade exposes an OpenAI-compatible API so existing applications work without code changes, and ships with a desktop app for model management and testing. Supports GGUF, ONNX, and SafeTensors across Windows, Linux, macOS, and Docker.

open-sourceOpen Source

Unsloth

2x faster LLM fine-tuning with 70% less VRAM on a single GPU

Unsloth is an open-source framework for fine-tuning large language models up to 2x faster while using 70% less VRAM. Built with custom Triton kernels, it supports 500+ model architectures including Llama 4, Qwen 3, and DeepSeek on consumer NVIDIA GPUs. Unsloth Studio adds a no-code web UI for dataset creation, training observability, model comparison, and GGUF export for Ollama and vLLM deployment.

open-sourceOpen Source

llm-d

Kubernetes-native distributed LLM inference stack

llm-d is an open-source Kubernetes-native stack for distributed LLM inference with cache-aware routing and disaggregated serving. It separates prefill and decode stages across different GPU pools for optimal resource utilization, routes requests to nodes with warm KV caches, and integrates with vLLM as the serving engine. Apache-2.0 licensed with 2,900+ GitHub stars.

open-sourceOpen Source

kitty

The fast, feature-rich terminal

GPU-accelerated terminal emulator written in C and Python, focused on performance and features. Supports ligatures, true color, graphics protocol for displaying images/plots inline, tabs, splits, and remote control via IPC. Highly configurable via a plain text config file. Cross-platform on macOS and Linux. Features a kitten framework for writing terminal programs in Python. Known for innovation in terminal graphics. 26K+ GitHub stars and a dedicated power-user community.

open-sourceOpen Source

WezTerm

GPU-accelerated terminal with Lua config

GPU-accelerated cross-platform terminal emulator written in Rust with configuration in Lua for maximum flexibility. Supports multiplexing (splits, tabs, workspaces), ligatures, true color, sixel/iTerm2/Kitty image protocols, and SSH multiplexer for remote sessions. Extensive keyboard/mouse customization, dynamic color schemes, and a built-in serial port mode. Works on macOS, Linux, Windows, and FreeBSD. Known for deep customizability. 19K+ GitHub stars.

open-sourceOpen Source