aicoolies logo

# monitoring

24 tools tagged

Showing 24 of 24 tools

Latitude

Sentry-style observability for AI agent conversations

Latitude is an agent observability platform for teams that need to inspect LLM traces, conversations, issues, and evaluation feedback in one workflow. Its public repo and docs position it as a Sentry-style monitor for AI agents, with semantic search, issue detection, annotations, MCP-assisted fixes, and cloud or self-hosted deployment paths for production debugging.

freemiumOpen SourceTelemetry

Spotlight by Backplanes

Session reports for Claude Code and Codex runs

Spotlight by Backplanes turns completed Claude Code and Codex sessions into concise reports for engineering, security, and spend review. The CLI installs on macOS, Linux, or WSL 2, watches sessions after they finish, redacts PII and credentials locally before upload, then summarizes files touched, commands run, external domains reached, scope drift, risky actions, and next-session improvements.

freemiumTelemetry
Judgeval logo

Judgeval

Open-source post-building layer for agents — tracing, evals, and online monitoring

Judgeval is the open-source post-building layer for AI agents from Judgment Labs, providing OpenTelemetry-based tracing, hosted and custom evaluation scorers, and online behavior monitoring for LLM-powered applications. Instrument any function with a single decorator, score live production traffic against faithfulness and instruction-adherence checks, and feed real-world failures back into reinforcement learning or supervised fine-tuning loops.

open-sourceOpen Source
Weights & Biases logo

Weights & Biases

ML experiment tracking and model monitoring

Weights & Biases is an AI developer platform for experiment tracking, artifact and model lineage, model monitoring, and Weave-based LLM evaluation. It helps teams log runs, compare metrics, manage datasets and model artifacts, and collaborate through dashboards, reports, alerts, SSO/RBAC controls, and hosted or self-managed deployment options.

freemium
Resolve AI logo

Resolve AI

AI-powered production incident resolution

Resolve AI automates production incident investigation, diagnosis, and remediation acting as an AI SRE that participates in every on-call rotation. Autonomously investigates incidents pursuing multiple hypotheses in parallel, validates against real evidence, creates code snippets and drafts PRs, generates post-mortems, and onboards new teammates with instant answers about code and infrastructure. Drives 5x faster MTTR and 87% faster incident investigations.

paid
fig-security logo

Fig Security

Security operations resilience for SOC teams

Fig provides a Security Operations Resilience platform designed for modern SOC teams facing both unplanned and planned changes. Features drift detection to catch unplanned infrastructure changes, automated drift repair with testing, planned change modeling to simulate initiatives before deployment, version control, and automatic deployment with rollbacks. Helps teams maintain security coverage while shipping risk-free at 10x speed and focusing on strategic cyber work.

paid
Laminar logo

Laminar

Open-source observability for AI agents

Laminar is an open-source observability platform for AI agents providing tracing, evaluation, and analytics for LLM applications. It integrates with Vercel AI SDK, LangChain, OpenAI, and Anthropic with a single line of code. Features include OpenTelemetry-native SDKs, an extensible evaluation framework with CI/CD support, SQL access to traces and metrics, and a visual debugging timeline for agent reasoning and actions.

freemiumOpen Source
Keep logo

Keep

Open-source AIOps alert management platform

Keep is an open-source AIOps platform that provides a single pane of glass for all alerts from monitoring tools like Datadog, PagerDuty, Grafana, and 50+ integrations. It uses AI to correlate, deduplicate, and enrich alerts, reducing noise and helping on-call teams focus on real incidents. Keep includes workflow automation, bidirectional sync with ticketing systems, and a modern web dashboard.

open-sourceOpen Source
ScaleOps logo

ScaleOps

Autonomous Kubernetes and GPU infrastructure optimization

ScaleOps provides autonomous real-time management of Kubernetes and GPU infrastructure, reducing cloud costs by up to 80 percent without manual configuration. Backed by 130 million in Series C funding at an 800 million dollar valuation, it serves enterprises including Adobe, Wiz, DocuSign, and Salesforce. The platform continuously rightsizes pods, optimizes replicas, manages nodes, and allocates GPUs based on live workload demand rather than static configurations.

freemium
OpenObserve logo

OpenObserve

All-in-one open-source observability — logs, metrics, traces, RUM

OpenObserve is an open-source observability platform that unifies logs, metrics, traces, and real user monitoring in a single binary. It claims 140x lower storage costs than Elasticsearch through columnar storage and compression, with native OpenTelemetry support, a built-in query UI, dashboards, and alerts. Designed for AI and cloud-native workloads at petabyte scale. Over 15,000 GitHub stars.

open-sourceOpen Source

Netdata MCP

Observability data accessible to AI agents via MCP

Netdata's MCP integration exposes infrastructure monitoring, discovery, and root-cause analysis capabilities to AI agents. Built into the 78K+ star Netdata monitoring platform, it lets agents query real-time metrics, explore system health, investigate incidents, and generate observability reports through the Model Context Protocol.

open-sourceOpen Source
Dash0 logo

Dash0

AI-driven log analysis with zero false positives

Dash0 is an AI-driven observability platform focused on log analysis that auto-structures unstructured logs, provides instant alerting with zero false positives, and delivers full-stack tracing capabilities. It uses AI to transform raw log data into structured, searchable events without requiring manual parsing configuration, making log-based debugging significantly faster for engineering teams.

paidOpen Source
Rootly logo

Rootly

AI-powered incident management in Slack and Teams

Rootly is an AI-native incident management platform that runs entirely within Slack and Microsoft Teams, automating incident workflows from detection through postmortem. It reduces manual incident overhead with AI-generated summaries, automated role assignments, escalation paths, and postmortem drafts, holding SOC 2 Type II, GDPR, and HIPAA compliance certifications for enterprise use.

free
Confident AI logo

Confident AI

Evaluation-first LLM and agent observability

Confident AI is an evaluation-first observability platform that scores every trace and span with 50+ metrics, alerting on quality drops in LLM and agent applications. It goes beyond traditional APM by treating evaluation as core observability, providing actionable insights that help teams understand not just whether their AI applications are running but whether they are producing correct and useful outputs.

freemium
Coralogix logo

Coralogix

AI observability with security posture management

Coralogix uses AI to provide actionable insights across logs and traces with a dedicated AI-SPM dashboard for tracking prompt injections and data leaks in AI applications. Its pay-per-use model with no upfront fees integrates security posture management directly into the observability stack, making it uniquely positioned for teams running both traditional and AI-powered production workloads.

api-usage-based
Middleware logo

Middleware

Full-stack observability platform with OpenTelemetry-friendly telemetry, LLM observability, and AI SRE workflows.

Middleware is a full-stack observability platform for infrastructure, APM, logs, metrics, traces, RUM, synthetics, browser testing, LLM observability, and AI SRE workflows. It targets teams that want OpenTelemetry-friendly telemetry, faster incident correlation, and a 14-day free trial before Pay As You Go or Enterprise observability commitments and rollout planning.

freemium
Monte Carlo logo

Monte Carlo

Data and AI observability for enterprise teams

Monte Carlo is the leading data and AI observability platform using ML to monitor pipelines, warehouses, and lakes for quality issues. It detects freshness delays, volume anomalies, schema changes, and distribution shifts before they impact analytics. With 500+ deployments at Nasdaq, Honeywell, and Roche, it provides automated root cause analysis, field-level lineage, and incident management. Available on AWS and Azure Marketplace.

paid
Evidently AI logo

Evidently AI

Open-source ML and LLM monitoring with 100+ metrics

Evidently AI is an open-source platform with 100+ pre-built metrics for monitoring data quality, model performance, and data drift in AI/ML pipelines. Available under Apache 2.0 with a cloud version, it helps teams detect when production data shifts away from training distributions, LLM output quality degrades, or feature pipelines introduce anomalies that silently degrade model accuracy.

open-sourceOpen Source
New Relic logo

New Relic

Full-stack observability with AI-powered monitoring

New Relic is a full-stack observability platform combining APM, infrastructure monitoring, logs, traces, browser/mobile monitoring, synthetics, and AIOps. Current public copy highlights 50+ capabilities, 100 GB/month free data ingest, one free full platform user, unlimited basic users, and 800+ pre-built integrations.

freemium
WhyLabs logo

WhyLabs

Dead

Discontinued AI observability company with open-source platform handoff

WhyLabs was an AI observability platform for monitoring ML models, LLM apps, and data pipelines. WhyLabs, Inc. has discontinued operations; docs say the AI Control Center became Apache-2.0 OSS on January 23, 2025 and hosted SaaS access ended March 9, 2025. The whylogs, LangKit, and whylabs-oss repos remain public, so this page is a self-hosted OSS handoff, not an active managed SaaS recommendation.

open-sourceOpen Source
Prometheus logo

Prometheus

Open-source monitoring and alerting toolkit — the CNCF standard for metrics collection.

Prometheus is the open-source monitoring system and time-series database that has become the CNCF standard for metrics collection in cloud-native environments. Features a powerful query language (PromQL), pull-based metrics collection, multi-dimensional data model, and built-in alerting via Alertmanager. The foundation of modern Kubernetes observability.

open-sourceOpen Source
Sentry logo

Sentry

Application monitoring and error tracking that helps developers fix issues faster.

Sentry is the leading error tracking and performance monitoring platform for developers. Captures and aggregates errors with full stack traces, breadcrumbs, and context across 100+ platforms. Used by over 100,000 organizations. Features session replay, performance tracing, and code-level profiling. Open source self-hosted option available.

freemiumOpen Source
Datadog logo

Datadog

Cloud-scale monitoring, security, and analytics platform for modern infrastructure.

Datadog is a cloud observability and security platform that unifies metrics, traces, logs, RUM, synthetics, APM, and security signals. Current pricing pages list 1,000+ integrations for Infrastructure Monitoring, with Pro from $15/host/month and Enterprise from $23/host/month when billed annually.

freemium
Grafana logo

Grafana

Open-source observability platform for metrics, logs, and traces visualization.

Grafana is the leading open-source platform for monitoring and observability visualization. It connects to virtually any data source — Prometheus, Elasticsearch, InfluxDB, PostgreSQL, CloudWatch, Datadog, and 150+ others — to create beautiful, interactive dashboards. Used by millions of users at companies like Bloomberg, JPMorgan, eBay, and PayPal. Grafana Cloud offers a fully managed experience with generous free tier. The CNCF ecosystem standard for metrics visualization.

open-sourceOpen Source