# llm-testing

11 tools tagged

Showing 11 of 11 tools

RagaAI Catalyst

AI testing and evaluation for agents and LLM apps

RagaAI Catalyst is a comprehensive Python SDK for observability, monitoring, and evaluation of LLM and agentic applications. Provides agent tracing with execution graph visualization, self-hosted dashboard with analytics, synthetic data generation, multi-metric evaluation framework, and guardrail management. Built for teams running production RAG systems and AI agents who need systematic testing, debugging, and performance optimization workflows.

open-sourceOpen Source

Laminar

Open-source observability for AI agents

Laminar is an open-source observability platform for AI agents providing tracing, evaluation, and analytics for LLM applications. It integrates with Vercel AI SDK, LangChain, OpenAI, and Anthropic with a single line of code. Features include OpenTelemetry-native SDKs, an extensible evaluation framework with CI/CD support, SQL access to traces and metrics, and a visual debugging timeline for agent reasoning and actions.

freemiumOpen Source

SWE-bench

Benchmark for evaluating AI coding agents on real GitHub issues

SWE-bench is a benchmark from Princeton NLP that evaluates AI coding agents by testing their ability to resolve real GitHub issues from popular open-source projects. Each task provides an issue description and repository state, and the agent must produce a working patch that passes the project's test suite. With 4,600+ GitHub stars, it has become the standard yardstick for comparing autonomous coding tools like Devin, Claude Code, and OpenHands.

open-sourceOpen Source

DeepTeam

Open-source LLM red-teaming framework with 40+ attack types

DeepTeam is an open-source red-teaming framework for systematically testing LLM applications against 40+ adversarial attack types. It covers OWASP Top 10 for LLMs including jailbreaks, prompt injection, PII leakage, and hallucination attacks. Built as the sister project of DeepEval for security testing alongside evaluation. Apache-2.0 licensed.

open-sourceOpen Source

Giskard

AI quality testing for bias, drift, and vulnerabilities

Giskard is an open-source testing framework for evaluating AI model quality, detecting bias, data drift, and security vulnerabilities. It provides automated test generation for LLMs and tabular models, scanning for issues like hallucination, prompt injection susceptibility, stereotypical outputs, and data leakage. Integrates with CI/CD pipelines for continuous model validation before deployment.

freemiumOpen Source

Confident AI

Evaluation-first LLM and agent observability

Confident AI is an evaluation-first observability platform that scores every trace and span with 50+ metrics, alerting on quality drops in LLM and agent applications. It goes beyond traditional APM by treating evaluation as core observability, providing actionable insights that help teams understand not just whether their AI applications are running but whether they are producing correct and useful outputs.

freemium

ps-fuzz

Prompt fuzzing tool for LLM security testing

ps-fuzz by Prompt Security is a security testing tool with 680+ GitHub stars that fuzzes system prompts against dynamic LLM-based attack scenarios including jailbreaks, prompt injection, and data extraction attempts. It helps developers harden their GenAI applications by simulating adversarial attacks in a controlled environment, turning LLM security into a testable and reproducible quality gate.

open-sourceOpen Source

OpenAI Evals

Framework for evaluating LLM and agent performance

OpenAI Evals is an open-source framework and benchmark registry for evaluating LLM performance on custom tasks. It provides infrastructure for writing evaluation prompts, running them against models, and recording results in a structured format for comparison. The hosted Evals API on the OpenAI platform adds managed run tracking, dataset management, and programmatic access to evaluation pipelines. With 17,700+ GitHub stars, it serves as a foundation for systematic LLM quality measurement.

open-sourceOpen Source

DeepEval

Apache-2.0 Python framework for repeatable LLM, RAG, agent, MCP, and safety evaluation workflows.

DeepEval is an Apache-2.0 Python framework for evaluating LLM apps, RAG systems, agents, MCP workflows, and safety behavior with repeatable test cases. It works locally and in CI/CD, then connects to Confident AI for hosted reports, observability, red teaming, and governance when teams need shared evidence instead of ad-hoc prompt reviews and manual QA.

open-sourceOpen Source

RAGAS

Evaluation framework for RAG pipelines

RAGAS is an Apache-2.0 open-source evaluation framework with 14K+ GitHub stars that provides standardized metrics for assessing RAG pipeline quality. It measures faithfulness, answer relevancy, context precision, and context recall to identify whether retrieval, generation, or both are failing. It is framework-agnostic, supports LLM-as-judge evaluation, and its README discloses minimal anonymized Open Analytics with a RAGAS_DO_NOT_TRACK opt-out.

open-sourceOpen SourceTelemetry

Promptfoo

LLM testing and evaluation toolkit

Promptfoo is an OpenAI-owned open-source toolkit for evaluating, red-teaming and securing LLM applications. It supports config-driven prompt/model tests, CI regression gates, red-team scans, guardrails, model security workflows, MCP Proxy, code scanning and evaluations across prompts, agents and RAG pipelines.

open-sourceOpen Source