LLM observability, AI application monitoring, data drift detection, model performance tracking, and intelligent debugging tools for AI-powered applications.
Showing 24 of 61 tools
Open-source AIOps alert management platform
Keep is an open-source AIOps platform that provides a single pane of glass for all alerts from monitoring tools like Datadog, PagerDuty, Grafana, and 50+ integrations. It uses AI to correlate, deduplicate, and enrich alerts, reducing noise and helping on-call teams focus on real incidents. Keep includes workflow automation, bidirectional sync with ticketing systems, and a modern web dashboard.
Full-stack monitoring with session replay and tracing
Highlight.io is an open-source full-stack monitoring platform that combines session replay, error monitoring, logging, and distributed tracing in a single tool. It captures user sessions as replayable videos alongside frontend errors, backend logs, and traces, letting teams see exactly what users experienced when issues occur. Self-hostable via Docker with a generous free tier on their managed cloud service.
CNCF vendor-neutral standard for distributed traces, metrics, and logs
OpenTelemetry is the CNCF standard for generating, collecting, and exporting telemetry data including distributed traces, metrics, and logs. It provides instrumentation SDKs for every major programming language, the OpenTelemetry Collector for data processing and routing, and semantic conventions that ensure consistent telemetry across services. The second most active CNCF project after Kubernetes.
ClickHouse-powered APM with 100% trace sampling and affordable retention
Uptrace is an OpenTelemetry-native APM platform that stores traces, metrics, and logs in ClickHouse for high-performance querying with 100% trace sampling. Unlike platforms that sample traces to control costs, Uptrace retains every trace for complete visibility into system behavior. Self-hosted under BSL license or available as a managed cloud service with affordable per-event pricing.
eBPF auto-instrumentation for OpenTelemetry without code changes
Odigos provides automatic OpenTelemetry instrumentation for Kubernetes applications using eBPF, requiring zero code changes or SDK integration. It detects running applications, identifies their programming language, and attaches appropriate instrumentation to generate distributed traces, metrics, and logs. CNCF project that routes telemetry to any OpenTelemetry-compatible backend including Jaeger, Grafana, and Datadog.
Open-source replacement for Datadog, PagerDuty, and StatusPage combined
OneUptime is an MIT-licensed observability platform that combines infrastructure monitoring, incident management, status pages, and APM in a single self-hosted solution. It replaces the need for separate Datadog, PagerDuty, and Atlassian StatusPage subscriptions. Features OpenTelemetry-native data ingestion, on-call scheduling, automated incident workflows, and public status page hosting.
Open-source AI observability platform for LLM tracing and evaluation
Phoenix by Arize is an open-source AI observability platform for tracing, evaluating, and debugging LLM applications. It captures prompt-response pairs, retrieval context, agent tool calls, and latency data through OpenTelemetry-based instrumentation. Provides experiment tracking, dataset management, and evaluation frameworks for systematically improving AI application quality. Over 9,200 GitHub stars.
High-cardinality observability platform for debugging production systems
Honeycomb is an observability platform built for high-cardinality data analysis that enables debugging of complex distributed systems. It provides interactive query exploration, distributed tracing visualization, and SLO monitoring without requiring pre-defined dashboards or metrics aggregation. Raised over $115M to build query-driven observability where any dimension can be used to slice production data.
CNCF Sandbox Kubernetes alert enrichment and automation platform
Robusta is a CNCF Sandbox project that enriches Kubernetes alerts with diagnostic context and automates remediation workflows. It intercepts Prometheus alerts, attaches relevant logs, pod status, resource metrics, and troubleshooting suggestions before delivering them to Slack, Teams, or PagerDuty. Supports custom playbooks for automated incident response and AI-powered root cause analysis.
AI-powered SRE agent for Kubernetes troubleshooting
Metoro is an AI SRE platform for Kubernetes that combines observability with autonomous troubleshooting. Its Guardian agent monitors cluster health, correlates metrics, logs, and traces to identify root causes, and suggests remediation actions. Features an MCP server for integration with AI coding agents and natural language querying of infrastructure state.
Zero-instrumentation Kubernetes observability powered by eBPF
Coroot is an open-source observability platform that uses eBPF to automatically instrument Kubernetes applications without code changes. It provides application maps, latency analysis, log correlation, and continuous profiling with automatic anomaly detection. Replaces the need for manual instrumentation with agents that capture metrics, traces, and logs at the kernel level.
Developer-focused APM for Ruby, Elixir, Node.js, and Python
AppSignal is an application performance monitoring platform for Ruby, Elixir, Node.js, Python, and frontend JavaScript. It combines error tracking, performance monitoring, host metrics, anomaly detection, and uptime checks in a single dashboard with per-request pricing. The Netherlands-based company serves over 10,000 developers with an interface designed for clarity over enterprise complexity.
OpenTelemetry-native observability for LLM applications with evals and GPU monitoring
OpenLIT is an open-source AI engineering platform that provides OpenTelemetry-native observability for LLM applications. It combines distributed tracing, evaluation, prompt management, a secrets vault, and GPU telemetry in a single self-hostable stack. With 50+ integrations across LLM providers and frameworks, it lets teams monitor AI applications using their existing observability backends like Grafana, Datadog, or Jaeger.
Open-source observability platform unifying logs, traces, and session replays
HyperDX is an open-source observability platform that correlates session replays, logs, metrics, traces, and errors in a single interface powered by ClickHouse and OpenTelemetry. Acquired by ClickHouse in 2025, it now forms the visualization layer of ClickStack. It offers schema-agnostic querying on any ClickHouse cluster, intuitive full-text and property search syntax, and blazing-fast analytics. Available as a self-hosted Docker deployment or a managed cloud service with a free tier.
Open-source full-stack observability with metrics, logs, and traces
SigNoz is an open-source observability platform that unifies metrics, logs, and traces in a single interface — built natively on OpenTelemetry. With over 26,000 GitHub stars, it provides a self-hosted alternative to Datadog and New Relic with no per-host pricing, columnar storage via ClickHouse for fast queries, and dashboards, alerts, and service maps out of the box.
Lightweight server monitoring with Docker stats and alerts
Beszel is a lightweight, self-hosted server monitoring platform built in Go that tracks CPU, memory, disk, network, GPU, temperature, and Docker container metrics with historical data visualization and configurable alerts. Its simple hub-and-agent architecture deploys in minutes and consumes minimal resources compared to traditional monitoring stacks like Prometheus and Grafana.
Open-source LLMOps platform for prompt management and evaluation
Agenta is an open-source LLMOps platform that combines prompt engineering playgrounds, prompt version management, LLM evaluation, and observability in a unified interface. It supports 50+ LLM models with side-by-side prompt comparison, A/B testing, human evaluation workflows, and OpenTelemetry-native tracing. Self-hostable with 4,000+ GitHub stars.
All-in-one open-source observability — logs, metrics, traces, RUM
OpenObserve is an open-source observability platform that unifies logs, metrics, traces, and real user monitoring in a single binary. It claims 140x lower storage costs than Elasticsearch through columnar storage and compression, with native OpenTelemetry support, a built-in query UI, dashboards, and alerts. Designed for AI and cloud-native workloads at petabyte scale. Over 15,000 GitHub stars.
Open-source LLM gateway with built-in optimization and A/B testing
TensorZero is an open-source LLMOps platform in Rust that unifies an LLM gateway, observability, prompt optimization, and A/B experimentation in a single binary. It routes requests across providers with sub-millisecond P99 latency at 10K+ QPS while capturing structured data for continuous improvement. Supports dynamic in-context learning, fine-tuning workflows, and production feedback loops. Backed by $7.3M seed funding, 11K+ GitHub stars.
Industry-standard incident management and on-call alerting platform
PagerDuty is the dominant incident management platform providing on-call scheduling, alert routing, escalation policies, and incident response orchestration. Integrates with 650+ monitoring, ticketing, and chat tools including Datadog, Slack, Jira, and AWS. Features AIOps for noise reduction and automated diagnostics. Used by thousands of engineering teams globally for 24/7 operations. Free tier for up to 5 users; Professional from $21/user/mo; Business at $41/user/mo.
Slack-native incident management with AI SRE agent
Incident.io is a Slack-native incident management platform with an AI SRE that autonomously investigates alerts, correlates deployments with telemetry, and drafts fix pull requests. Used by Buffer (70% fewer critical incidents), Favor (37% MTTR reduction), Intercom, and Productboard. Features include automated workflows, on-call scheduling, post-incident learning, and status pages. Integrates with PagerDuty, Datadog, GitHub, Jira, and 100+ tools.
LLM evaluation and tracking with RAG triad metrics
TruLens is an open-source framework for evaluating and tracking LLM experiments with feedback functions, RAG triad metrics (answer relevance, context relevance, groundedness), and Honest/Harmless/Helpful evaluations. Features a unified Metric API for systematic evaluation of RAG pipelines and AI agents. 3,200+ GitHub stars, MIT licensed. Snowflake partnership adds enterprise integration. Supports LangChain, LlamaIndex, and custom LLM applications.
Observability platform purpose-built for Python and Pydantic AI apps
Pydantic Logfire is an observability platform built by the Pydantic team specifically for Python AI applications. It provides structured logging, distributed tracing, and metrics with native understanding of Pydantic models, FastAPI, and AI framework data types. Auto-instruments OpenAI, Anthropic, LangChain, and other LLM providers. Built on OpenTelemetry for vendor-neutral data export. Offers a managed cloud dashboard with a generous free tier for development and small-scale production use.
OpenTelemetry-based observability SDK for LLM applications
Traceloop's OpenLLMetry is an open-source observability SDK that instruments LLM applications using the OpenTelemetry standard. It auto-traces calls to OpenAI, Anthropic, Cohere, Pinecone, ChromaDB, LangChain, and other AI frameworks, sending data to any OTEL-compatible backend like Datadog, Grafana, Jaeger, or Honeycomb. Backed by Battery Ventures with 2,000+ GitHub stars. Ideal for teams already using OpenTelemetry who want LLM observability without vendor lock-in.