# monitoring
24 tools tagged
Showing 24 of 24 tools
Latitude
Sentry-style observability for AI agent conversations
Latitude is an agent observability platform for teams that need to inspect LLM traces, conversations, issues, and evaluation feedback in one workflow. Its public repo and docs position it as a Sentry-style monitor for AI agents, with semantic search, issue detection, annotations, MCP-assisted fixes, and cloud or self-hosted deployment paths for production debugging.
Spotlight by Backplanes
Session reports for Claude Code and Codex runs
Spotlight by Backplanes turns completed Claude Code and Codex sessions into concise reports for engineering, security, and spend review. The CLI installs on macOS, Linux, or WSL 2, watches sessions after they finish, redacts PII and credentials locally before upload, then summarizes files touched, commands run, external domains reached, scope drift, risky actions, and next-session improvements.
Judgeval
Open-source post-building layer for agents — tracing, evals, and online monitoring
Judgeval is the open-source post-building layer for AI agents from Judgment Labs, providing OpenTelemetry-based tracing, hosted and custom evaluation scorers, and online behavior monitoring for LLM-powered applications. Instrument any function with a single decorator, score live production traffic against faithfulness and instruction-adherence checks, and feed real-world failures back into reinforcement learning or supervised fine-tuning loops.
Weights & Biases
ML experiment tracking and model monitoring
Weights & Biases is an AI developer platform for experiment tracking, artifact and model lineage, model monitoring, and Weave-based LLM evaluation. It helps teams log runs, compare metrics, manage datasets and model artifacts, and collaborate through dashboards, reports, alerts, SSO/RBAC controls, and hosted or self-managed deployment options.
Resolve AI
AI-powered production incident resolution
Resolve AI automates production incident investigation, diagnosis, and remediation acting as an AI SRE that participates in every on-call rotation. Autonomously investigates incidents pursuing multiple hypotheses in parallel, validates against real evidence, creates code snippets and drafts PRs, generates post-mortems, and onboards new teammates with instant answers about code and infrastructure. Drives 5x faster MTTR and 87% faster incident investigations.
Fig Security
Security operations resilience for SOC teams
Fig provides a Security Operations Resilience platform designed for modern SOC teams facing both unplanned and planned changes. Features drift detection to catch unplanned infrastructure changes, automated drift repair with testing, planned change modeling to simulate initiatives before deployment, version control, and automatic deployment with rollbacks. Helps teams maintain security coverage while shipping risk-free at 10x speed and focusing on strategic cyber work.
Laminar
Open-source observability for AI agents
Laminar is an open-source observability platform for AI agents providing tracing, evaluation, and analytics for LLM applications. It integrates with Vercel AI SDK, LangChain, OpenAI, and Anthropic with a single line of code. Features include OpenTelemetry-native SDKs, an extensible evaluation framework with CI/CD support, SQL access to traces and metrics, and a visual debugging timeline for agent reasoning and actions.
Keep
Open-source AIOps alert management platform
Keep is an open-source AIOps platform that provides a single pane of glass for all alerts from monitoring tools like Datadog, PagerDuty, Grafana, and 50+ integrations. It uses AI to correlate, deduplicate, and enrich alerts, reducing noise and helping on-call teams focus on real incidents. Keep includes workflow automation, bidirectional sync with ticketing systems, and a modern web dashboard.
ScaleOps
Autonomous Kubernetes and GPU infrastructure optimization
ScaleOps provides autonomous real-time management of Kubernetes and GPU infrastructure, reducing cloud costs by up to 80 percent without manual configuration. Backed by 130 million in Series C funding at an 800 million dollar valuation, it serves enterprises including Adobe, Wiz, DocuSign, and Salesforce. The platform continuously rightsizes pods, optimizes replicas, manages nodes, and allocates GPUs based on live workload demand rather than static configurations.
OpenObserve
All-in-one open-source observability — logs, metrics, traces, RUM
OpenObserve is an open-source observability platform that unifies logs, metrics, traces, and real user monitoring in a single binary. It claims 140x lower storage costs than Elasticsearch through columnar storage and compression, with native OpenTelemetry support, a built-in query UI, dashboards, and alerts. Designed for AI and cloud-native workloads at petabyte scale. Over 15,000 GitHub stars.
Netdata MCP
Observability data accessible to AI agents via MCP
Netdata's MCP integration exposes infrastructure monitoring, discovery, and root-cause analysis capabilities to AI agents. Built into the 78K+ star Netdata monitoring platform, it lets agents query real-time metrics, explore system health, investigate incidents, and generate observability reports through the Model Context Protocol.
Dash0
AI-driven log analysis with zero false positives
Dash0 is an AI-driven observability platform focused on log analysis that auto-structures unstructured logs, provides instant alerting with zero false positives, and delivers full-stack tracing capabilities. It uses AI to transform raw log data into structured, searchable events without requiring manual parsing configuration, making log-based debugging significantly faster for engineering teams.
Rootly
AI-powered incident management in Slack and Teams
Rootly is an AI-native incident management platform that runs entirely within Slack and Microsoft Teams, automating incident workflows from detection through postmortem. It reduces manual incident overhead with AI-generated summaries, automated role assignments, escalation paths, and postmortem drafts, holding SOC 2 Type II, GDPR, and HIPAA compliance certifications for enterprise use.
Confident AI
Evaluation-first LLM and agent observability
Confident AI is an evaluation-first observability platform that scores every trace and span with 50+ metrics, alerting on quality drops in LLM and agent applications. It goes beyond traditional APM by treating evaluation as core observability, providing actionable insights that help teams understand not just whether their AI applications are running but whether they are producing correct and useful outputs.
Coralogix
AI observability with security posture management
Coralogix uses AI to provide actionable insights across logs and traces with a dedicated AI-SPM dashboard for tracking prompt injections and data leaks in AI applications. Its pay-per-use model with no upfront fees integrates security posture management directly into the observability stack, making it uniquely positioned for teams running both traditional and AI-powered production workloads.
Middleware
Full-stack observability platform with OpenTelemetry-friendly telemetry, LLM observability, and AI SRE workflows.
Middleware is a full-stack observability platform for infrastructure, APM, logs, metrics, traces, RUM, synthetics, browser testing, LLM observability, and AI SRE workflows. It targets teams that want OpenTelemetry-friendly telemetry, faster incident correlation, and a 14-day free trial before Pay As You Go or Enterprise observability commitments and rollout planning.
Monte Carlo
Data and AI observability for enterprise teams
Monte Carlo is the leading data and AI observability platform using ML to monitor pipelines, warehouses, and lakes for quality issues. It detects freshness delays, volume anomalies, schema changes, and distribution shifts before they impact analytics. With 500+ deployments at Nasdaq, Honeywell, and Roche, it provides automated root cause analysis, field-level lineage, and incident management. Available on AWS and Azure Marketplace.
Evidently AI
Open-source ML and LLM monitoring with 100+ metrics
Evidently AI is an open-source platform with 100+ pre-built metrics for monitoring data quality, model performance, and data drift in AI/ML pipelines. Available under Apache 2.0 with a cloud version, it helps teams detect when production data shifts away from training distributions, LLM output quality degrades, or feature pipelines introduce anomalies that silently degrade model accuracy.
New Relic
Full-stack observability with AI-powered monitoring
New Relic is a full-stack observability platform combining APM, infrastructure monitoring, logs, traces, browser/mobile monitoring, synthetics, and AIOps. Current public copy highlights 50+ capabilities, 100 GB/month free data ingest, one free full platform user, unlimited basic users, and 800+ pre-built integrations.
WhyLabs
DeadDiscontinued AI observability company with open-source platform handoff
WhyLabs was an AI observability platform for monitoring ML models, LLM apps, and data pipelines. WhyLabs, Inc. has discontinued operations; docs say the AI Control Center became Apache-2.0 OSS on January 23, 2025 and hosted SaaS access ended March 9, 2025. The whylogs, LangKit, and whylabs-oss repos remain public, so this page is a self-hosted OSS handoff, not an active managed SaaS recommendation.
Prometheus
Open-source monitoring and alerting toolkit — the CNCF standard for metrics collection.
Prometheus is the open-source monitoring system and time-series database that has become the CNCF standard for metrics collection in cloud-native environments. Features a powerful query language (PromQL), pull-based metrics collection, multi-dimensional data model, and built-in alerting via Alertmanager. The foundation of modern Kubernetes observability.
Sentry
Application monitoring and error tracking that helps developers fix issues faster.
Sentry is the leading error tracking and performance monitoring platform for developers. Captures and aggregates errors with full stack traces, breadcrumbs, and context across 100+ platforms. Used by over 100,000 organizations. Features session replay, performance tracing, and code-level profiling. Open source self-hosted option available.
Datadog
Cloud-scale monitoring, security, and analytics platform for modern infrastructure.
Datadog is a cloud observability and security platform that unifies metrics, traces, logs, RUM, synthetics, APM, and security signals. Current pricing pages list 1,000+ integrations for Infrastructure Monitoring, with Pro from $15/host/month and Enterprise from $23/host/month when billed annually.
Grafana
Open-source observability platform for metrics, logs, and traces visualization.
Grafana is the leading open-source platform for monitoring and observability visualization. It connects to virtually any data source — Prometheus, Elasticsearch, InfluxDB, PostgreSQL, CloudWatch, Datadog, and 150+ others — to create beautiful, interactive dashboards. Used by millions of users at companies like Bloomberg, JPMorgan, eBay, and PayPal. Grafana Cloud offers a fully managed experience with generous free tier. The CNCF ecosystem standard for metrics visualization.