aicoolies logo
Judgeval logo

Judgeval

Open-source post-building layer for agents — tracing, evals, and online monitoring

Share
open-sourceOpen Source
Visit Website →

Judgeval is the open-source post-building layer for AI agents from Judgment Labs, providing OpenTelemetry-based tracing, hosted and custom evaluation scorers, and online behavior monitoring for LLM-powered applications. Instrument any function with a single decorator, score live production traffic against faithfulness and instruction-adherence checks, and feed real-world failures back into reinforcement learning or supervised fine-tuning loops.

Judgeval is the open-source post-building layer for AI agents, built by Judgment Labs to solve the last-mile reliability problem teams hit once their agents are running in production. The Python SDK wraps any function with the @Tracer.observe() decorator and emits OpenTelemetry-compatible traces, so it slots into existing observability stacks without forcing teams onto a proprietary backend. Around that tracing core, the project layers a hosted evaluation engine with built-in scorers for faithfulness, answer relevancy, instruction adherence, and tool selection, alongside the option to register custom Judge classes that return binary, numeric, or categorical responses tuned to a team's specific quality bar.

What sets Judgeval apart from generic LLM observability tools is its post-training orientation. Captured production behavior is not only inspected after the fact — it can be replayed as evaluation datasets, exported as labeled traces for supervised fine-tuning, or used to reward and penalize trajectories during reinforcement learning runs (including GRPO-style pipelines). This makes Judgeval one of the few open-source projects that treats agent monitoring and agent training as a single closed loop, with the same primitives surfacing in tracing dashboards and post-training scripts. Native integrations with LangChain, LangGraph, and LlamaIndex mean most modern agent stacks can plug in without rewriting orchestration code.

The licensing model is straightforward: the SDK and core platform are Apache 2.0, self-hostable, and the GitHub repository carries a public scorer library that teams can extend. Judgment Labs offers a managed cloud for teams that prefer not to run their own ingestion infrastructure, but the open-source path is fully featured rather than a stripped-down teaser. With over a thousand GitHub stars, daily commits, and active integrations across the agent framework ecosystem, Judgeval is a strong fit for teams that want Sentry-style production monitoring for their agents without surrendering ownership of their evaluation data or training pipelines.

Pricing

Open-source (Apache 2.0) / Judgment Labs managed cloud usage-based

Platforms

Self-hosted (Python SDK, OpenTelemetry) / Managed cloud / LangChain, LangGraph, LlamaIndex integrations

Categories

Tags

Use Cases

Alternatives

TraceRoot logo

TraceRoot

Open-source observability and self-healing layer for AI agents

TraceRoot is a YC S25-backed open-source observability platform purpose-built for AI agents and LLM apps. It combines OpenTelemetry-compatible tracing with an agentic debugging runtime that reads your source code, correlates failures with recent commits, and proposes fix PRs automatically. BYOK support spans seven LLM providers; the entire stack runs self-hosted via Docker Compose, with TraceRoot Cloud available for managed deployments.

open-sourceOpen Source
LangSmith logo

LangSmith

LLM application observability and evaluation platform

LangSmith is LangChain's platform for debugging, testing, evaluating, and monitoring LLM applications in production. Provides detailed tracing of every step in LLM chains and agent workflows, dataset management for regression testing, prompt versioning, and automated evaluation with custom metrics. Features an annotation queue for human feedback, online monitoring dashboards, and integration with LangChain, LangGraph, and any LLM framework via the Python/JS SDK. Essential for production LLM ops.

freemium
Langfuse logo

Langfuse

Open-source LLM engineering platform for observability

Langfuse is an open-source LLM engineering platform with 29K+ GitHub stars for tracing, evaluating, and monitoring AI applications. Acquired by ClickHouse, it provides detailed traces of LLM calls, prompt management with versioning, dataset-based evaluation, user feedback collection, and cost tracking. Framework-agnostic with native integrations for LangChain, LlamaIndex, OpenAI SDK, and Vercel AI SDK. Offers both self-hosted deployment and a managed cloud service.

open-sourceOpen Source

Related Tools

eve vercel

eve by Vercel

Filesystem-first framework for durable AI agents

Eve is Vercel's filesystem-first TypeScript framework for building durable AI agents as ordinary project files. It combines Markdown instructions and skills, typed tools, channels, connections, subagents, schedules, sandboxes, and evals with Vercel's agent runtime so teams can ship deployable agents without hand-rolling orchestration. The current beta fits Vercel-native backend agent projects.

open-sourceOpen Source
BrowserOS logo

BrowserOS

Open-source agentic browser that runs local AI agents in your browsing workflow.

BrowserOS is a privacy-first, open-source agentic browser for running AI assistants locally inside real browsing sessions instead of handing every task to a remote cloud browser.

open-sourceOpen Source
Agent Governance Toolkit logo

Agent Governance Toolkit

Microsoft’s open-source toolkit for adding policy enforcement, identity, sandboxing, and audit controls to production AI agents.

Agent Governance Toolkit is an open-source Microsoft project for teams moving AI agents from demos into controlled production workflows. It focuses on runtime policy enforcement, zero-trust identity, sandboxed execution, and reliability patterns around autonomous agents, giving security and platform teams a governance layer around tool calls and agent actions rather than another prompt-only guardrail.

open-sourceOpen SourceTelemetry
rampart

Rampart

Microsoft’s pytest-native red teaming framework for turning AI agent safety findings into CI tests.

RAMPART is an open-source Microsoft framework for safety and security testing of agentic AI applications. It brings red-team findings into a pytest-native workflow so teams can turn prompt injection, unsafe tool use, and behavioral boundary failures into repeatable regression tests. The strongest aicoolies angle is developer workflow: RAMPART makes agent safety part of CI/CD instead of a one-off security review.

open-sourceOpen Source
OpenHuman logo

OpenHuman

Local-first personal AI agent with memory trees, desktop integrations, and private workspace context.

OpenHuman is an open-source, local-first personal AI agent from TinyHumans. It combines a desktop app, persistent memory trees, Obsidian-compatible storage, OAuth integrations, and local model support into a private assistant harness. It is most interesting for users who want agentic workflows and long-term memory without handing every context detail to a fully cloud-hosted assistant.

open-sourceOpen SourceTelemetry
Unabyss logo

Unabyss

MCP-native personal context vault for keeping AI agents aligned with your work, voice, and projects.

Unabyss is a personal context headquarters for AI agents. It syncs sources such as email, Slack, Notion, Drive, meetings, and professional profiles into structured context files that can be served to MCP-capable clients. The strongest angle is not generic note taking; it is permissioned, reusable context for Claude, Cursor, custom agents, and other tools that otherwise need the same background explained repeatedly.

freemiumTelemetry