Loading...
Loading...
Using AI tools to identify, diagnose, and fix bugs — from automated error analysis to intelligent stack trace interpretation and root cause detection
Showing 24 of 116 tools
Open-source AIOps alert management platform
Keep is an open-source AIOps platform that provides a single pane of glass for all alerts from monitoring tools like Datadog, PagerDuty, Grafana, and 50+ integrations. It uses AI to correlate, deduplicate, and enrich alerts, reducing noise and helping on-call teams focus on real incidents. Keep includes workflow automation, bidirectional sync with ticketing systems, and a modern web dashboard.
Full-stack monitoring with session replay and tracing
Highlight.io is an open-source full-stack monitoring platform that combines session replay, error monitoring, logging, and distributed tracing in a single tool. It captures user sessions as replayable videos alongside frontend errors, backend logs, and traces, letting teams see exactly what users experienced when issues occur. Self-hostable via Docker with a generous free tier on their managed cloud service.
Extremely fast Python type checker written in Rust
ty is an extremely fast Python type checker built in Rust by Astral, the team behind Ruff and uv. It performs full type inference, supports PEP 695 type parameter syntax, and checks Python code orders of magnitude faster than mypy or pyright. ty completes the Astral Python toolchain alongside Ruff for linting and uv for package management, giving developers a unified Rust-powered development experience.
Run GitHub Actions locally for fast feedback
Act is an open-source tool that runs GitHub Actions workflows locally using Docker containers that match GitHub's execution environment. It provides instant feedback on workflow changes without pushing to a repository, supports matrix builds, secret management, and artifact handling. Act can also replace Makefiles by using workflow files as task definitions, making it useful for both CI/CD development and local task automation across development teams.
High-cardinality observability platform for debugging production systems
Honeycomb is an observability platform built for high-cardinality data analysis that enables debugging of complex distributed systems. It provides interactive query exploration, distributed tracing visualization, and SLO monitoring without requiring pre-defined dashboards or metrics aggregation. Raised over $115M to build query-driven observability where any dimension can be used to slice production data.
CNCF Sandbox Kubernetes alert enrichment and automation platform
Robusta is a CNCF Sandbox project that enriches Kubernetes alerts with diagnostic context and automates remediation workflows. It intercepts Prometheus alerts, attaches relevant logs, pod status, resource metrics, and troubleshooting suggestions before delivering them to Slack, Teams, or PagerDuty. Supports custom playbooks for automated incident response and AI-powered root cause analysis.
AI-powered SRE agent for Kubernetes troubleshooting
Metoro is an AI SRE platform for Kubernetes that combines observability with autonomous troubleshooting. Its Guardian agent monitors cluster health, correlates metrics, logs, and traces to identify root causes, and suggests remediation actions. Features an MCP server for integration with AI coding agents and natural language querying of infrastructure state.
Zero-instrumentation Kubernetes observability powered by eBPF
Coroot is an open-source observability platform that uses eBPF to automatically instrument Kubernetes applications without code changes. It provides application maps, latency analysis, log correlation, and continuous profiling with automatic anomaly detection. Replaces the need for manual instrumentation with agents that capture metrics, traces, and logs at the kernel level.
Microsoft's screen parsing model for GUI agent interaction
OmniParser is Microsoft's open-source screen parsing toolkit that converts GUI screenshots into structured, actionable data for AI agents. It detects interactive UI elements like buttons, input fields, and icons, then generates grounded descriptions that enable language models to interact with any desktop or web application. Accumulated over 24,000 GitHub stars as a foundational layer for computer-use agents.
Developer-focused APM for Ruby, Elixir, Node.js, and Python
AppSignal is an application performance monitoring platform for Ruby, Elixir, Node.js, Python, and frontend JavaScript. It combines error tracking, performance monitoring, host metrics, anomaly detection, and uptime checks in a single dashboard with per-request pricing. The Netherlands-based company serves over 10,000 developers with an interface designed for clarity over enterprise complexity.
OpenTelemetry-native observability for LLM applications with evals and GPU monitoring
OpenLIT is an open-source AI engineering platform that provides OpenTelemetry-native observability for LLM applications. It combines distributed tracing, evaluation, prompt management, a secrets vault, and GPU telemetry in a single self-hostable stack. With 50+ integrations across LLM providers and frameworks, it lets teams monitor AI applications using their existing observability backends like Grafana, Datadog, or Jaeger.
Open-source observability platform unifying logs, traces, and session replays
HyperDX is an open-source observability platform that correlates session replays, logs, metrics, traces, and errors in a single interface powered by ClickHouse and OpenTelemetry. Acquired by ClickHouse in 2025, it now forms the visualization layer of ClickStack. It offers schema-agnostic querying on any ClickHouse cluster, intuitive full-text and property search syntax, and blazing-fast analytics. Available as a self-hosted Docker deployment or a managed cloud service with a free tier.
Bayesian git bisection for finding commits that caused flaky tests
Git Bayesect applies Bayesian inference to git bisection, solving the problem of finding commits that introduced non-deterministic bugs like flaky tests. Unlike standard git bisect which requires binary pass-fail results, Git Bayesect handles probabilistic outcomes where a test might pass sometimes and fail sometimes, using entropy minimization to efficiently narrow down the culprit commit.
AI coding agent for embedded systems and firmware engineering
Embedder is a specialized AI coding agent for firmware and embedded systems development. It supports 400+ MCU variants including STM32 and ESP32, parses hardware datasheets to understand register maps and pin configurations, and verifies generated code by interacting with physical boards via serial console. YC S25 participant currently in beta.
Open-source LLMOps platform for prompt management and evaluation
Agenta is an open-source LLMOps platform that combines prompt engineering playgrounds, prompt version management, LLM evaluation, and observability in a unified interface. It supports 50+ LLM models with side-by-side prompt comparison, A/B testing, human evaluation workflows, and OpenTelemetry-native tracing. Self-hostable with 4,000+ GitHub stars.
All-in-one open-source observability — logs, metrics, traces, RUM
OpenObserve is an open-source observability platform that unifies logs, metrics, traces, and real user monitoring in a single binary. It claims 140x lower storage costs than Elasticsearch through columnar storage and compression, with native OpenTelemetry support, a built-in query UI, dashboards, and alerts. Designed for AI and cloud-native workloads at petabyte scale. Over 15,000 GitHub stars.
Slack-native incident management with AI SRE agent
Incident.io is a Slack-native incident management platform with an AI SRE that autonomously investigates alerts, correlates deployments with telemetry, and drafts fix pull requests. Used by Buffer (70% fewer critical incidents), Favor (37% MTTR reduction), Intercom, and Productboard. Features include automated workflows, on-call scheduling, post-incident learning, and status pages. Integrates with PagerDuty, Datadog, GitHub, Jira, and 100+ tools.
Observability platform purpose-built for Python and Pydantic AI apps
Pydantic Logfire is an observability platform built by the Pydantic team specifically for Python AI applications. It provides structured logging, distributed tracing, and metrics with native understanding of Pydantic models, FastAPI, and AI framework data types. Auto-instruments OpenAI, Anthropic, LangChain, and other LLM providers. Built on OpenTelemetry for vendor-neutral data export. Offers a managed cloud dashboard with a generous free tier for development and small-scale production use.
Non-agent approach to automated software engineering via localize-and-repair
Agentless takes a deliberate non-agent approach to LLM-powered software engineering. Instead of autonomous agents making tool calls, it uses a structured localize-then-repair pipeline: first narrowing down which files and functions are relevant, then generating targeted patches. Achieved competitive SWE-Bench results at $0.34 average cost per issue. Adopted by OpenAI for o3 evaluations. 3,000+ GitHub stars, MIT licensed. A counterpoint to the agent-heavy trend in AI coding tools.
OpenTelemetry-based observability SDK for LLM applications
Traceloop's OpenLLMetry is an open-source observability SDK that instruments LLM applications using the OpenTelemetry standard. It auto-traces calls to OpenAI, Anthropic, Cohere, Pinecone, ChromaDB, LangChain, and other AI frameworks, sending data to any OTEL-compatible backend like Datadog, Grafana, Jaeger, or Honeycomb. Backed by Battery Ventures with 2,000+ GitHub stars. Ideal for teams already using OpenTelemetry who want LLM observability without vendor lock-in.
Pipe terminal output to LLMs with beautiful markdown rendering
Mods by Charmbracelet is an AI-powered CLI tool that lets you pipe any shell output directly to LLMs with beautiful markdown rendering in the terminal. It follows Unix philosophy — composable with existing tools via stdin/stdout pipes. Supports OpenAI, Anthropic, Ollama, and other providers with bring-your-own-key model. 9,800+ GitHub stars, MIT license. Part of the Charm ecosystem known for gorgeous terminal UIs including Bubble Tea, Lip Gloss, and Glow.
Observability data accessible to AI agents via MCP
Netdata's MCP integration exposes infrastructure monitoring, discovery, and root-cause analysis capabilities to AI agents. Built into the 78K+ star Netdata monitoring platform, it lets agents query real-time metrics, explore system health, investigate incidents, and generate observability reports through the Model Context Protocol.
AI-driven log analysis with zero false positives
Dash0 is an AI-driven observability platform focused on log analysis that auto-structures unstructured logs, provides instant alerting with zero false positives, and delivers full-stack tracing capabilities. It uses AI to transform raw log data into structured, searchable events without requiring manual parsing configuration, making log-based debugging significantly faster for engineering teams.
Kubernetes troubleshooting with event context
Komodor is a Kubernetes troubleshooting platform that extracts event and change context from clusters, correlating deployments, config changes, and infrastructure events to quickly identify the root cause of pod failures. Its Slack integration delivers incident context directly into team channels, helping SRE and platform teams reduce mean time to resolution by connecting the dots between what changed and what broke.