Judgeval is the open-source post-building layer for AI agents, built by Judgment Labs to solve the last-mile reliability problem teams hit once their agents are running in production. The Python SDK wraps any function with the @Tracer.observe() decorator and emits OpenTelemetry-compatible traces, so it slots into existing observability stacks without forcing teams onto a proprietary backend. Around that tracing core, the project layers a hosted evaluation engine with built-in scorers for faithfulness, answer relevancy, instruction adherence, and tool selection, alongside the option to register custom Judge classes that return binary, numeric, or categorical responses tuned to a team's specific quality bar.

What sets Judgeval apart from generic LLM observability tools is its post-training orientation. Captured production behavior is not only inspected after the fact — it can be replayed as evaluation datasets, exported as labeled traces for supervised fine-tuning, or used to reward and penalize trajectories during reinforcement learning runs (including GRPO-style pipelines). This makes Judgeval one of the few open-source projects that treats agent monitoring and agent training as a single closed loop, with the same primitives surfacing in tracing dashboards and post-training scripts. Native integrations with LangChain, LangGraph, and LlamaIndex mean most modern agent stacks can plug in without rewriting orchestration code.

The licensing model is straightforward: the SDK and core platform are Apache 2.0, self-hostable, and the GitHub repository carries a public scorer library that teams can extend. Judgment Labs offers a managed cloud for teams that prefer not to run their own ingestion infrastructure, but the open-source path is fully featured rather than a stripped-down teaser. With over a thousand GitHub stars, daily commits, and active integrations across the agent framework ecosystem, Judgeval is a strong fit for teams that want Sentry-style production monitoring for their agents without surrendering ownership of their evaluation data or training pipelines.

Judgeval

Pricing

Platforms

Categories

Tags

Use Cases

Alternatives

TraceRoot

Related Tools

Roomote

LangSmith

Langfuse

Freestyle

GraphBit

OpenSRE

Evolver

chrome-devtools-mcp