# evaluation

9 tools tagged

Showing 9 of 9 tools

Judgeval

Open-source post-building layer for agents — tracing, evals, and online monitoring

Judgeval is the open-source post-building layer for AI agents from Judgment Labs, providing OpenTelemetry-based tracing, hosted and custom evaluation scorers, and online behavior monitoring for LLM-powered applications. Instrument any function with a single decorator, score live production traffic against faithfulness and instruction-adherence checks, and feed real-world failures back into reinforcement learning or supervised fine-tuning loops.

open-sourceOpen Source

SWE-bench

Benchmark for evaluating AI coding agents on real GitHub issues

SWE-bench is a benchmark from Princeton NLP that evaluates AI coding agents by testing their ability to resolve real GitHub issues from popular open-source projects. Each task provides an issue description and repository state, and the agent must produce a working patch that passes the project's test suite. With 4,600+ GitHub stars, it has become the standard yardstick for comparing autonomous coding tools like Devin, Claude Code, and OpenHands.

open-sourceOpen Source

Fairlearn

Python toolkit for assessing and mitigating ML model fairness issues

Fairlearn is a Microsoft-backed open-source Python toolkit that helps developers assess and improve the fairness of machine learning models. It provides metrics for measuring disparity across groups defined by sensitive features, mitigation algorithms that reduce unfairness while maintaining model performance, and an interactive visualization dashboard for exploring fairness-accuracy trade-offs. Integrated with scikit-learn and Azure ML's Responsible AI dashboard.

open-sourceOpen Source

OpenLIT

OpenTelemetry-native observability for LLM applications with evals and GPU monitoring

OpenLIT is an open-source AI engineering platform that provides OpenTelemetry-native observability for LLM applications. It combines distributed tracing, evaluation, prompt management, a secrets vault, and GPU telemetry in a single self-hostable stack. With 50+ integrations across LLM providers and frameworks, it lets teams monitor AI applications using their existing observability backends like Grafana, Datadog, or Jaeger.

open-sourceOpen Source

Agenta

Open-source LLMOps platform for prompt management and evaluation

Agenta is an open-source LLMOps platform that combines prompt engineering playgrounds, prompt version management, LLM evaluation, and observability in a unified interface. It supports 50+ LLM models with side-by-side prompt comparison, A/B testing, human evaluation workflows, and OpenTelemetry-native tracing. Self-hostable with 4,000+ GitHub stars.

open-sourceOpen Source

OpenEvals

Lightweight eval library for LLM applications

OpenEvals is a lightweight evaluation library from the LangChain team for testing LLM application quality using LLM-as-judge patterns. It provides pre-built prompt sets and evaluation functions that score model outputs against criteria like accuracy, relevance, coherence, and safety without requiring complex infrastructure. Available as both Python and JavaScript packages, OpenEvals complements OpenAI Evals with a simpler, framework-agnostic approach to quality measurement in agentic workflows.

open-sourceOpen Source

OpenAI Evals

Framework for evaluating LLM and agent performance

OpenAI Evals is an open-source framework and benchmark registry for evaluating LLM performance on custom tasks. It provides infrastructure for writing evaluation prompts, running them against models, and recording results in a structured format for comparison. The hosted Evals API on the OpenAI platform adds managed run tracking, dataset management, and programmatic access to evaluation pipelines. With 17,700+ GitHub stars, it serves as a foundation for systematic LLM quality measurement.

open-sourceOpen Source

Prompt Flow

Build and evaluate LLM apps end-to-end

Prompt Flow is Microsoft's open-source development suite for building, testing, evaluating, and deploying LLM-based applications end-to-end. It links LLM calls, prompts, Python code, and other tools into executable flows defined in YAML, with a VS Code extension providing a visual flow designer. The tool supports tracing LLM interactions for debugging, running batch evaluations with quality metrics against larger datasets, and integrating tests into CI/CD pipelines before production deployment.

open-sourceOpen Source

Arize Phoenix

Open-source LLM observability and evaluation

Phoenix by Arize is an open-source AI observability platform for tracing, evaluating, and debugging LLM applications. It captures prompt-response pairs, retrieval context, agent tool calls, and latency data through OpenTelemetry-based instrumentation. Provides experiment tracking, dataset management, and evaluation frameworks for systematically improving AI application quality. 10K+ GitHub stars.

open-sourceOpen Source