8 tools tagged
Showing 8 of 8 tools
Open-source AI observability platform for LLM tracing and evaluation
Phoenix by Arize is an open-source AI observability platform for tracing, evaluating, and debugging LLM applications. It captures prompt-response pairs, retrieval context, agent tool calls, and latency data through OpenTelemetry-based instrumentation. Provides experiment tracking, dataset management, and evaluation frameworks for systematically improving AI application quality. Over 9,200 GitHub stars.
Benchmark for evaluating AI coding agents on real GitHub issues
SWE-bench is a benchmark from Princeton NLP that evaluates AI coding agents by testing their ability to resolve real GitHub issues from popular open-source projects. Each task provides an issue description and repository state, and the agent must produce a working patch that passes the project's test suite. With 4,600+ GitHub stars, it has become the standard yardstick for comparing autonomous coding tools like Devin, Claude Code, and OpenHands.
Python toolkit for assessing and mitigating ML model fairness issues
Fairlearn is a Microsoft-backed open-source Python toolkit that helps developers assess and improve the fairness of machine learning models. It provides metrics for measuring disparity across groups defined by sensitive features, mitigation algorithms that reduce unfairness while maintaining model performance, and an interactive visualization dashboard for exploring fairness-accuracy trade-offs. Integrated with scikit-learn and Azure ML's Responsible AI dashboard.
OpenTelemetry-native observability for LLM applications with evals and GPU monitoring
OpenLIT is an open-source AI engineering platform that provides OpenTelemetry-native observability for LLM applications. It combines distributed tracing, evaluation, prompt management, a secrets vault, and GPU telemetry in a single self-hostable stack. With 50+ integrations across LLM providers and frameworks, it lets teams monitor AI applications using their existing observability backends like Grafana, Datadog, or Jaeger.
Open-source LLMOps platform for prompt management and evaluation
Agenta is an open-source LLMOps platform that combines prompt engineering playgrounds, prompt version management, LLM evaluation, and observability in a unified interface. It supports 50+ LLM models with side-by-side prompt comparison, A/B testing, human evaluation workflows, and OpenTelemetry-native tracing. Self-hostable with 4,000+ GitHub stars.
Lightweight eval library for LLM applications
OpenEvals is a lightweight evaluation library from the LangChain team for testing LLM application quality using LLM-as-judge patterns. It provides pre-built prompt sets and evaluation functions that score model outputs against criteria like accuracy, relevance, coherence, and safety without requiring complex infrastructure. Available as both Python and JavaScript packages, OpenEvals complements OpenAI Evals with a simpler, framework-agnostic approach to quality measurement in agentic workflows.
Framework for evaluating LLM and agent performance
OpenAI Evals is an open-source framework and benchmark registry for evaluating LLM performance on custom tasks. It provides infrastructure for writing evaluation prompts, running them against models, and recording results in a structured format for comparison. The hosted Evals API on the OpenAI platform adds managed run tracking, dataset management, and programmatic access to evaluation pipelines. With 17,700+ GitHub stars, it serves as a foundation for systematic LLM quality measurement.
Build and evaluate LLM apps end-to-end
Prompt Flow is Microsoft's open-source development suite for building, testing, evaluating, and deploying LLM-based applications end-to-end. It links LLM calls, prompts, Python code, and other tools into executable flows defined in YAML, with a VS Code extension providing a visual flow designer. The tool supports tracing LLM interactions for debugging, running batch evaluations with quality metrics against larger datasets, and integrating tests into CI/CD pipelines before production deployment.