11 tools tagged
Showing 11 of 11 tools
AI testing and evaluation for agents and LLM apps
RagaAI Catalyst is a comprehensive Python SDK for observability, monitoring, and evaluation of LLM and agentic applications. Provides agent tracing with execution graph visualization, self-hosted dashboard with analytics, synthetic data generation, multi-metric evaluation framework, and guardrail management. Built for teams running production RAG systems and AI agents who need systematic testing, debugging, and performance optimization workflows.
Open-source observability for AI agents
Laminar is an open-source observability platform for AI agents providing tracing, evaluation, and analytics for LLM applications. It integrates with Vercel AI SDK, LangChain, OpenAI, and Anthropic with a single line of code. Features include OpenTelemetry-native SDKs, an extensible evaluation framework with CI/CD support, SQL access to traces and metrics, and a visual debugging timeline for agent reasoning and actions.
Benchmark for evaluating AI coding agents on real GitHub issues
SWE-bench is a benchmark from Princeton NLP that evaluates AI coding agents by testing their ability to resolve real GitHub issues from popular open-source projects. Each task provides an issue description and repository state, and the agent must produce a working patch that passes the project's test suite. With 4,600+ GitHub stars, it has become the standard yardstick for comparing autonomous coding tools like Devin, Claude Code, and OpenHands.
Open-source LLM red-teaming framework with 40+ attack types
DeepTeam is an open-source red-teaming framework for systematically testing LLM applications against 40+ adversarial attack types. It covers OWASP Top 10 for LLMs including jailbreaks, prompt injection, PII leakage, and hallucination attacks. Built as the sister project of DeepEval for security testing alongside evaluation. Apache-2.0 licensed.
AI quality testing for bias, drift, and vulnerabilities
Giskard is an open-source testing framework for evaluating AI model quality, detecting bias, data drift, and security vulnerabilities. It provides automated test generation for LLMs and tabular models, scanning for issues like hallucination, prompt injection susceptibility, stereotypical outputs, and data leakage. Integrates with CI/CD pipelines for continuous model validation before deployment.
Evaluation-first LLM and agent observability
Confident AI is an evaluation-first observability platform that scores every trace and span with 50+ metrics, alerting on quality drops in LLM and agent applications. It goes beyond traditional APM by treating evaluation as core observability, providing actionable insights that help teams understand not just whether their AI applications are running but whether they are producing correct and useful outputs.
Prompt fuzzing tool for LLM security testing
ps-fuzz by Prompt Security is a security testing tool with 660+ GitHub stars that fuzzes system prompts against dynamic LLM-based attack scenarios including jailbreaks, prompt injection, and data extraction attempts. It helps developers harden their GenAI applications by simulating adversarial attacks in a controlled environment, turning LLM security into a testable and reproducible quality gate.
Framework for evaluating LLM and agent performance
OpenAI Evals is an open-source framework and benchmark registry for evaluating LLM performance on custom tasks. It provides infrastructure for writing evaluation prompts, running them against models, and recording results in a structured format for comparison. The hosted Evals API on the OpenAI platform adds managed run tracking, dataset management, and programmatic access to evaluation pipelines. With 17,700+ GitHub stars, it serves as a foundation for systematic LLM quality measurement.
Unit testing framework for LLM applications
DeepEval is an open-source LLM unit testing framework with 4K+ GitHub stars that brings pytest-like syntax to AI application testing. Provides 14+ evaluation metrics including faithfulness, hallucination, bias, toxicity, and answer relevancy with LLM-as-judge scoring. Tests run locally with any LLM provider. Features synthetic dataset generation, regression testing, and CI/CD integration. Write test cases with familiar assert patterns to catch quality regressions before deployment.
Evaluation framework for RAG pipelines
RAGAS is an open-source evaluation framework with 8K+ GitHub stars that provides standardized metrics for assessing RAG pipeline quality. Measures faithfulness, answer relevancy, context precision, and context recall to identify exactly where a RAG system fails — retrieval, generation, or both. Framework-agnostic with support for any LLM as evaluator. Integrates with LangChain, LlamaIndex, and CI/CD pipelines for automated regression testing of RAG applications.
LLM testing and evaluation toolkit
Open-source tool for testing, evaluating, and red-teaming LLM applications. Promptfoo lets developers define test cases, run prompts across multiple models and configurations, and score outputs with built-in metrics like factuality, relevance, and toxicity. Includes red-teaming for jailbreak and hallucination detection plus CI/CD integration for automated prompt regression testing.