# observability
67 tools tagged
Showing 24 of 67 tools
Latitude
Sentry-style observability for AI agent conversations
Latitude is an agent observability platform for teams that need to inspect LLM traces, conversations, issues, and evaluation feedback in one workflow. Its public repo and docs position it as a Sentry-style monitor for AI agents, with semantic search, issue detection, annotations, MCP-assisted fixes, and cloud or self-hosted deployment paths for production debugging.
Spotlight by Backplanes
Session reports for Claude Code and Codex runs
Spotlight by Backplanes turns completed Claude Code and Codex sessions into concise reports for engineering, security, and spend review. The CLI installs on macOS, Linux, or WSL 2, watches sessions after they finish, redacts PII and credentials locally before upload, then summarizes files touched, commands run, external domains reached, scope drift, risky actions, and next-session improvements.
agentmemory
Persistent memory layer for AI coding agents — keeps Claude Code, Codex, Cursor, and any MCP agent in context across sessions
agentmemory is an open-source MCP server that gives AI coding agents persistent, cross-session memory. Built on hybrid vector-graph search, it achieves 95.2% recall on the LongMemEval-S benchmark while using up to 92% fewer context tokens than naive context injection. Works out of the box with Claude Code, Codex, Cursor, Windsurf, Cline, OpenCode, Kilo Code, Hermes, and any MCP client through 51 MCP tools plus 12 hooks and 4 skills.
Traceway
OpenTelemetry-native observability with AI tracing, logs, traces, metrics, and session replay — self-hosted in 90 seconds.
Traceway is an open-source, OpenTelemetry-native observability platform that combines logs, traces, metrics, exceptions, session replay, and AI tracing in a single self-hosted system. MIT licensed with no open-core restrictions, it deploys in 90 seconds via Docker Compose and accepts OTLP/HTTP from any OTel SDK without a Collector or per-language vendor SDK.
Judgeval
Open-source post-building layer for agents — tracing, evals, and online monitoring
Judgeval is the open-source post-building layer for AI agents from Judgment Labs, providing OpenTelemetry-based tracing, hosted and custom evaluation scorers, and online behavior monitoring for LLM-powered applications. Instrument any function with a single decorator, score live production traffic against faithfulness and instruction-adherence checks, and feed real-world failures back into reinforcement learning or supervised fine-tuning loops.
TraceRoot
Open-source observability and self-healing layer for AI agents
TraceRoot is a YC S25-backed open-source observability platform purpose-built for AI agents and LLM apps. It combines OpenTelemetry-compatible tracing with an agentic debugging runtime that reads your source code, correlates failures with recent commits, and proposes fix PRs automatically. BYOK support spans seven LLM providers; the entire stack runs self-hosted via Docker Compose, with TraceRoot Cloud available for managed deployments.
Evolver
Self-evolution engine for AI agents with auditable updates
Evolver is an open-source self-evolution engine for AI agents that turns run logs into auditable, reviewable updates via its Genome Evolution Protocol. Instead of ad hoc prompt tweaking, teams collect traces and Evolver proposes versioned diffs to prompts, tools and workflows that engineers can approve, reject or roll back like code.
CodeBurn
See where your AI coding tokens actually go
Open-source TUI dashboard and CLI that shows where your AI coding tokens actually go, broken down by task type, tool, model, MCP server, and project. CodeBurn reads local session data directly from Claude Code, Codex, Cursor, OpenCode, Pi, and GitHub Copilot — no wrapper, proxy, or API keys — and layers on one-shot success rates so you can see whether the AI nails work first try or burns budget on edit/test/fix retries. Ships with a macOS menu bar widget and CSV/JSON export.
Weights & Biases
ML experiment tracking and model monitoring
Weights & Biases is an AI developer platform for experiment tracking, artifact and model lineage, model monitoring, and Weave-based LLM evaluation. It helps teams log runs, compare metrics, manage datasets and model artifacts, and collaborate through dashboards, reports, alerts, SSO/RBAC controls, and hosted or self-managed deployment options.
Resolve AI
AI-powered production incident resolution
Resolve AI automates production incident investigation, diagnosis, and remediation acting as an AI SRE that participates in every on-call rotation. Autonomously investigates incidents pursuing multiple hypotheses in parallel, validates against real evidence, creates code snippets and drafts PRs, generates post-mortems, and onboards new teammates with instant answers about code and infrastructure. Drives 5x faster MTTR and 87% faster incident investigations.
RagaAI Catalyst
AI testing and evaluation for agents and LLM apps
RagaAI Catalyst is a comprehensive Python SDK for observability, monitoring, and evaluation of LLM and agentic applications. Provides agent tracing with execution graph visualization, self-hosted dashboard with analytics, synthetic data generation, multi-metric evaluation framework, and guardrail management. Built for teams running production RAG systems and AI agents who need systematic testing, debugging, and performance optimization workflows.
Laminar
Open-source observability for AI agents
Laminar is an open-source observability platform for AI agents providing tracing, evaluation, and analytics for LLM applications. It integrates with Vercel AI SDK, LangChain, OpenAI, and Anthropic with a single line of code. Features include OpenTelemetry-native SDKs, an extensible evaluation framework with CI/CD support, SQL access to traces and metrics, and a visual debugging timeline for agent reasoning and actions.
PostHog
Open-source product analytics, session replay, and feature flags
PostHog is an open-source product and data tools platform for analytics, session replay, feature flags, experiments, surveys, error tracking, web analytics, data warehouse, CDP and LLM observability workflows. It suits developer-led teams that want one integrated product OS instead of many separate tools.
ElectricSQL
Postgres sync engine for local-first and real-time applications
ElectricSQL is a sync engine that keeps local application state synchronized with PostgreSQL in real-time. It enables local-first architectures where apps work offline with instant responsiveness, syncing data bidirectionally when connectivity is available. Supports partial replication with shape-based subscriptions to sync only relevant data subsets to each client.
Gel
Graph-relational database with EdgeQL query language, formerly EdgeDB
Gel (formerly EdgeDB) is a graph-relational database that combines the relational model with graph database traversal capabilities through its EdgeQL query language. Built on PostgreSQL, it eliminates the object-relational impedance mismatch with a type system that maps directly to application data models. Features built-in migrations, authentication, and an interactive web UI.
qodo-cover
AI-powered test generation agent for automated code coverage improvement
qodo-cover (formerly Cover Agent) is an open-source AI agent that automatically generates meaningful unit tests to improve code coverage. It analyzes existing code and test patterns to produce tests that follow project conventions and target uncovered branches. Uses an iterative approach where generated tests are verified by running them, discarding those that fail. MIT licensed with over 5,300 GitHub stars.
Krkn
CNCF Sandbox chaos engineering framework for Kubernetes resilience
Krkn is a CNCF Sandbox chaos engineering tool that tests Kubernetes cluster resilience by injecting controlled failures. It simulates pod kills, node failures, network partitions, CPU/memory pressure, and zone outages. Krkn-AI adds AI-powered scenario generation that suggests chaos experiments based on cluster topology. Supports CI/CD integration for automated resilience testing in deployment pipelines.
Robusta
CNCF Sandbox Kubernetes alert enrichment and automation platform
Robusta is a CNCF Sandbox project that enriches Kubernetes alerts with diagnostic context and automates remediation workflows. It intercepts Prometheus alerts, attaches relevant logs, pod status, resource metrics, and troubleshooting suggestions before delivering them to Slack, Teams, or PagerDuty. Supports custom playbooks for automated incident response and AI-powered root cause analysis.
Metoro
AI-powered SRE agent for Kubernetes troubleshooting
Metoro is an AI SRE platform for Kubernetes that combines observability with autonomous troubleshooting. Its Guardian agent monitors cluster health, correlates metrics, logs, and traces to identify root causes, and suggests remediation actions. Features an MCP server for integration with AI coding agents and natural language querying of infrastructure state.
Checkpoints by Entire
Git-native AI agent session capture and reasoning traceability
Checkpoints by Entire captures the full reasoning context behind AI-generated code directly in Git. Entire records transcripts, prompts, files touched, token usage, and tool calls alongside every commit. Session metadata lives on a separate branch keeping your history clean, with rewind capabilities to restore any previous agent checkpoint when things go sideways.
OpenLIT
OpenTelemetry-native observability for LLM applications with evals and GPU monitoring
OpenLIT is an open-source AI engineering platform that provides OpenTelemetry-native observability for LLM applications. It combines distributed tracing, evaluation, prompt management, a secrets vault, and GPU telemetry in a single self-hostable stack. With 50+ integrations across LLM providers and frameworks, it lets teams monitor AI applications using their existing observability backends like Grafana, Datadog, or Jaeger.
Hugging Face Skills
ACP skill definitions giving coding agents HuggingFace ML superpowers
Hugging Face Skills is the official collection of ACP skill definitions that give AI coding agents access to HuggingFace ML capabilities. The 13 skills cover LLM fine-tuning with TRL, vision model training, dataset management, model evaluation, and cloud job submission on HF infrastructure. Compatible with Claude Code, Codex, Gemini CLI, and Cursor via a single npx command.
Agenta
Open-source LLMOps platform for prompt management and evaluation
Agenta is an open-source LLMOps platform that combines prompt engineering playgrounds, prompt version management, LLM evaluation, and observability in a unified interface. It supports 50+ LLM models with side-by-side prompt comparison, A/B testing, human evaluation workflows, and OpenTelemetry-native tracing. Self-hostable with 4,000+ GitHub stars.
OpenObserve
All-in-one open-source observability — logs, metrics, traces, RUM
OpenObserve is an open-source observability platform that unifies logs, metrics, traces, and real user monitoring in a single binary. It claims 140x lower storage costs than Elasticsearch through columnar storage and compression, with native OpenTelemetry support, a built-in query UI, dashboards, and alerts. Designed for AI and cloud-native workloads at petabyte scale. Over 15,000 GitHub stars.