aicoolies logo
K8sGPT logo

K8sGPT

AI-powered Kubernetes diagnostics in plain English

Share
open-sourceOpen Source
Visit Website →

K8sGPT is a CNCF Sandbox project that scans Kubernetes clusters, diagnoses issues, and explains problems in plain English with actionable remediation steps. It codifies SRE expertise into built-in analyzers for Pods, Services, Deployments, Ingress, PVCs, CronJobs, and more. K8sGPT connects to AI backends including OpenAI, Azure OpenAI, Google Gemini, Amazon Bedrock, Cohere, and local models via Ollama, with data anonymization to protect sensitive cluster information.

K8sGPT was introduced in spring 2023 and accepted into the CNCF Sandbox in December of the same year. Written in Go, it works as either a standalone CLI binary or a Kubernetes operator that runs continuously inside the cluster. The CLI approach is straightforward: run k8sgpt analyze --explain and the tool scans the cluster, collects diagnostic data from resource statuses and events, sends anonymized context to the configured AI backend, and returns explanations with specific kubectl commands to resolve each issue. Without the --explain flag, K8sGPT still provides structured diagnostic output using its internal analyzers — essentially codified SRE playbooks — without making any AI calls at all.

The built-in analyzers cover core Kubernetes resources: Pods, Deployments, ReplicaSets, StatefulSets, Services, Ingress, PersistentVolumeClaims, CronJobs, and Nodes. Beyond these defaults, K8sGPT integrates with Trivy for security vulnerability scanning across container images in the cluster, and with AWS Controllers for Kubernetes to analyze AWS resources managed via CRDs. The AI backend options are broad — OpenAI, Azure OpenAI, Google Gemini and Vertex AI, Amazon Bedrock and SageMaker, Cohere, Hugging Face, IBM watsonx.ai, and local models through Ollama or LocalAI for air-gapped environments where no data can leave the network.

When deployed as a Kubernetes operator, K8sGPT runs in the background and writes analysis results to custom Result resources, enabling integration with existing monitoring stacks like Prometheus and Alertmanager for automated alerting on detected issues. The operator mode suits production environments that need continuous cluster health monitoring rather than ad-hoc troubleshooting. Installation is available through Homebrew, apt/apk packages, Windows binaries, and Helm charts for the operator. The latest release is v0.4.31, and the sister project Sympozium extends the concept to managing AI agents within Kubernetes clusters.

Pricing

Free and open-source under Apache 2.0, CNCF Sandbox

Platforms

CLI (Go binary), Kubernetes operator, Helm chart, multi-OS

Categories

Tags

Use Cases

Alternatives

Steel logo

Steel

Open-source browser infrastructure for AI agents at scale

Steel is an open-source browser API purpose-built for AI agents, providing managed headless browser sessions with anti-bot bypass, proxy rotation, CAPTCHA solving, and session persistence. It handles the infrastructure layer that browser automation agents like Browser Use and Stagehand run on top of. Self-hostable or available as a cloud service. Over 6,000 GitHub stars.

open-sourceOpen Source
Trigger.dev logo

Trigger.dev

Open-source background jobs and AI workflows for TypeScript

Trigger.dev is an open-source platform for building and deploying background jobs, AI agents, and long-running workflows in TypeScript. It eliminates serverless timeouts with durable task execution, automatic retries, queue-based concurrency control, and elastic scaling. Used by 30,000+ developers at companies like MagicSchool and Icon.com, it processes hundreds of millions of agent runs monthly. Backed by a $16M Series A led by Dalton Caldwell's Standard Capital fund.

freemiumOpen Source

Dokploy

Open-source PaaS alternative to Vercel, Heroku, and Netlify

Dokploy is a free open-source platform-as-a-service for self-hosting applications without cloud vendor lock-in. It provides automated deployments from Git repositories, built-in SSL certificates, database provisioning, Docker and Docker Compose support, and a clean web dashboard for managing multiple applications on your own servers. With 18,000+ GitHub stars, it fills the gap for teams wanting Vercel-like deployment simplicity on their own infrastructure.

open-sourceOpen Source

kubectl-ai

Google’s open-source Kubernetes assistant that translates natural-language intent into precise cluster operations.

kubectl-ai is an AI-powered Kubernetes assistant from Google Cloud Platform. It acts as an intelligent interface for cluster work, translating operator intent into Kubernetes commands and workflows. The key distinction from reactive diagnosis tools is that kubectl-ai is designed as an interactive natural-language interface for planning and executing Kubernetes operations, with provider configuration and MCP-oriented workflows around the CLI.

open-sourceOpen SourceTelemetry

Related Tools

KubeAI

Kubernetes operator for serving AI inference workloads

KubeAI is an Apache-2.0 Kubernetes operator for deploying and scaling AI inference workloads, including LLMs, embeddings, reranking, and speech-to-text. It gives platform teams OpenAI-compatible endpoints, model proxy/controller primitives, model caching, scale-from-zero behavior, and cluster-native resource management for self-hosted inference on Kubernetes.

open-sourceOpen Source
Vald logo

Vald

Cloud-native distributed vector search engine built for Kubernetes with automatic indexing and horizontal scaling.

Vald is a highly scalable distributed approximate nearest neighbor (ANN) vector search engine designed for cloud-native, Kubernetes-based architectures. Maintained by LY Corporation and listed in the CNCF Landscape, it uses the NGT algorithm (developed at Yahoo Japan), supports automatic incremental index backup, and handles billion-scale datasets across loosely coupled microservice components that scale horizontally via Helm.

open-sourceOpen Source
Freestyle logo

Freestyle

Sandboxes for coding agents — Linux VMs, Git, and deploys in one box

Freestyle is YC-backed sandbox infrastructure built for AI coding agents, shipping secure Linux VMs with nested virtualization, Git servers, and one-click web deploys. It lets agents run real workloads, branch repos, and deploy apps under short-lived identities while billing only for active compute. Used in production by vly.ai, Rork, and Vibeflow.

freemium
OpenSRE logo

OpenSRE

Open-source toolkit for building AI SRE incident response agents

OpenSRE is Tracer Cloud’s open-source public-alpha Python toolkit for building AI SRE agents that investigate and respond to production incidents. It ships 60+ tools across observability, databases, incident management, communications, deployment and protocol integrations, plus simulation/evaluation workflows for benchmarking agent accuracy before live pager use.

open-sourceOpen Source
Twill AI logo

Twill AI

Autonomous coding agents that ship while you sleep

Twill is an autonomous coding agent platform that implements features, fixes bugs, and ships pull requests without manual intervention. Uses structured workflow of research, planning, human review, implementation in isolated sandbox, AI code review, then merge. Supports custom agent configurations with multiple LLM providers, isolated dev environments for verification, and integrations with GitHub, Linear, Sentry, Notion, and cloud platforms for end-to-end engineering automation.

freemium
Baseten logo

Baseten

ML inference platform for production AI models

Baseten is the inference platform for deploying AI models at scale with dedicated and pre-optimized model APIs and performance-optimized infrastructure. Specializes in image generation, transcription, text-to-speech, LLM serving, embeddings, and compound AI workloads. Delivers 75% latency reduction with 415ms cold starts and 3000+ concurrent scaling. Available as managed cloud or self-hosted, trusted by Cursor, Notion, Descript, and Sourcegraph for production inference.

api-usage-based

Used in Stacks

Comparisons