Judgeval is the open-source post-building layer for AI agents, built by Judgment Labs to solve the last-mile reliability problem teams hit once their agents are running in production. The Python SDK wraps any function with the @Tracer.observe() decorator and emits OpenTelemetry-compatible traces, so it slots into existing observability stacks without forcing teams onto a proprietary backend. Around that tracing core, the project layers a hosted evaluation engine with built-in scorers for faithfulness, answer relevancy, instruction adherence, and tool selection, alongside the option to register custom Judge classes that return binary, numeric, or categorical responses tuned to a team's specific quality bar.
What sets Judgeval apart from generic LLM observability tools is its post-training orientation. Captured production behavior is not only inspected after the fact — it can be replayed as evaluation datasets, exported as labeled traces for supervised fine-tuning, or used to reward and penalize trajectories during reinforcement learning runs (including GRPO-style pipelines). This makes Judgeval one of the few open-source projects that treats agent monitoring and agent training as a single closed loop, with the same primitives surfacing in tracing dashboards and post-training scripts. Native integrations with LangChain, LangGraph, and LlamaIndex mean most modern agent stacks can plug in without rewriting orchestration code.
The licensing model is straightforward: the SDK and core platform are Apache 2.0, self-hostable, and the GitHub repository carries a public scorer library that teams can extend. Judgment Labs offers a managed cloud for teams that prefer not to run their own ingestion infrastructure, but the open-source path is fully featured rather than a stripped-down teaser. With over a thousand GitHub stars, daily commits, and active integrations across the agent framework ecosystem, Judgeval is a strong fit for teams that want Sentry-style production monitoring for their agents without surrendering ownership of their evaluation data or training pipelines.
