aicoolies logo
Dstack logo

Dstack

Open-source control plane for AI workloads across multi-cloud GPU infrastructure

Share
open-sourceOpen Source
Visit Website →

dstack is an open-source platform that orchestrates AI training and inference workloads across heterogeneous GPU infrastructure spanning multiple clouds, Kubernetes clusters, and bare-metal servers. It abstracts away cloud-specific APIs so teams define GPU requirements declaratively and dstack automatically provisions the cheapest available resources from AWS, GCP, Azure, Lambda, or on-premises hardware.

dstack is a control plane for AI infrastructure that solves the operational complexity of running training and inference workloads across diverse GPU environments. Modern AI teams face a fragmented landscape where GPU availability, pricing, and APIs differ across every cloud provider and on-premises setup. dstack provides a single declarative interface where developers specify what they need — GPU type, count, memory, and framework — and the platform handles provisioning, scheduling, and lifecycle management across all configured backends.

The platform supports NVIDIA, AMD, and Google TPU accelerators across AWS, GCP, Azure, Lambda Cloud, and self-managed Kubernetes or bare-metal clusters. Workloads are defined in YAML configuration files that specify resource requirements, Docker images, and execution commands. dstack's fleet management automatically discovers available GPUs, tracks utilization, and schedules jobs to minimize cost and maximize throughput. The auto-scaling engine provisions and deprovisisions cloud instances based on queue depth.

dstack has raised venture funding and maintains an active open-source project with over 2,000 GitHub stars. The MPL-2.0 license allows commercial use while requiring modifications to the core to be shared. For AI teams that have outgrown the workflow of manually SSH-ing into GPU instances or navigating cloud console UIs, dstack provides the infrastructure abstraction layer that makes multi-cloud GPU orchestration as straightforward as container orchestration with Kubernetes.

Pricing

Free open-source core; commercial managed offering

Platforms

Multi-cloud (AWS, GCP, Azure, Lambda), Kubernetes, bare metal

Categories

Tags

Use Cases

Alternatives

Daytona logo

Daytona

Open-source dev environment management with AI integration

Daytona is secure, elastic infrastructure for running AI-generated code in isolated sandboxes. It gives agents and developer workflows programmable environments with dedicated kernel, filesystem, network, vCPU, memory, and disk, backed by OCI/Docker compatibility, SDK/API access, and under-90ms sandbox startup. The project has 72,000+ GitHub stars and is AGPL-3.0 licensed.

open-sourceOpen Source
Railway logo

Railway

Infrastructure, instantly

Modern cloud platform for deploying full-stack apps, databases, and workers with instant provisioning and usage-based pricing. Deploy from GitHub or CLI with zero config for Node.js, Python, Go, Rust, and Docker. Built-in PostgreSQL, MySQL, Redis, and MongoDB with auto backups. Features private networking, environment management, cron jobs, TCP proxying, and real-time logs. Popular with indie hackers and startups for fast MVPs with a generous free trial including $5 monthly credits.

freemium
Coolify logo

Coolify

Self-hosted Heroku/Vercel alternative

Open-source, self-hostable PaaS alternative to Heroku, Vercel, and Netlify with 44K+ GitHub stars. Deploy static sites, APIs, full-stack apps, databases, and 280+ one-click services on your own VPS or bare metal via SSH. Features auto Let's Encrypt SSL, Git integration (GitHub/GitLab/Bitbucket/Gitea), S3 backups, Docker Swarm support, and a REST API for CI/CD automation. Self-hosted version is free forever with no features behind paywalls.

open-sourceOpen Source
DeepInfra logo

DeepInfra

Cost-effective AI inference platform with 86+ models from $0.02/M tokens

DeepInfra is an AI inference platform offering 86+ LLM models with pricing starting at $0.02 per million tokens. Backed by $20.6M in funding including an $18M Series A from Felicis Ventures, it provides OpenAI-compatible endpoints for models including DeepSeek, Llama, and Mistral with pay-as-you-go pricing.

api-usage-based

Related Tools

KubeAI

Kubernetes operator for serving AI inference workloads

KubeAI is an Apache-2.0 Kubernetes operator for deploying and scaling AI inference workloads, including LLMs, embeddings, reranking, and speech-to-text. It gives platform teams OpenAI-compatible endpoints, model proxy/controller primitives, model caching, scale-from-zero behavior, and cluster-native resource management for self-hosted inference on Kubernetes.

open-sourceOpen Source
Freestyle logo

Freestyle

Sandboxes for coding agents — Linux VMs, Git, and deploys in one box

Freestyle is YC-backed sandbox infrastructure built for AI coding agents, shipping secure Linux VMs with nested virtualization, Git servers, and one-click web deploys. It lets agents run real workloads, branch repos, and deploy apps under short-lived identities while billing only for active compute. Used in production by vly.ai, Rork, and Vibeflow.

freemium
OpenSRE logo

OpenSRE

Open-source toolkit for building AI SRE incident response agents

OpenSRE is Tracer Cloud’s open-source public-alpha Python toolkit for building AI SRE agents that investigate and respond to production incidents. It ships 60+ tools across observability, databases, incident management, communications, deployment and protocol integrations, plus simulation/evaluation workflows for benchmarking agent accuracy before live pager use.

open-sourceOpen Source
Twill AI logo

Twill AI

Autonomous coding agents that ship while you sleep

Twill is an autonomous coding agent platform that implements features, fixes bugs, and ships pull requests without manual intervention. Uses structured workflow of research, planning, human review, implementation in isolated sandbox, AI code review, then merge. Supports custom agent configurations with multiple LLM providers, isolated dev environments for verification, and integrations with GitHub, Linear, Sentry, Notion, and cloud platforms for end-to-end engineering automation.

freemium
Baseten logo

Baseten

ML inference platform for production AI models

Baseten is the inference platform for deploying AI models at scale with dedicated and pre-optimized model APIs and performance-optimized infrastructure. Specializes in image generation, transcription, text-to-speech, LLM serving, embeddings, and compound AI workloads. Delivers 75% latency reduction with 415ms cold starts and 3000+ concurrent scaling. Available as managed cloud or self-hosted, trusted by Cursor, Notion, Descript, and Sourcegraph for production inference.

api-usage-based
Resolve AI logo

Resolve AI

AI-powered production incident resolution

Resolve AI automates production incident investigation, diagnosis, and remediation acting as an AI SRE that participates in every on-call rotation. Autonomously investigates incidents pursuing multiple hypotheses in parallel, validates against real evidence, creates code snippets and drafts PRs, generates post-mortems, and onboards new teammates with instant answers about code and infrastructure. Drives 5x faster MTTR and 87% faster incident investigations.

paid