6 tools tagged
Showing 6 of 6 tools
Benchmark for evaluating AI coding agents on real GitHub issues
SWE-bench is a benchmark from Princeton NLP that evaluates AI coding agents by testing their ability to resolve real GitHub issues from popular open-source projects. Each task provides an issue description and repository state, and the agent must produce a working patch that passes the project's test suite. With 4,600+ GitHub stars, it has become the standard yardstick for comparing autonomous coding tools like Devin, Claude Code, and OpenHands.
Open-source LLM red-teaming framework with 40+ attack types
DeepTeam is an open-source red-teaming framework for systematically testing LLM applications against 40+ adversarial attack types. It covers OWASP Top 10 for LLMs including jailbreaks, prompt injection, PII leakage, and hallucination attacks. Built as the sister project of DeepEval for security testing alongside evaluation. Apache-2.0 licensed.
Evaluation-first LLM and agent observability
Confident AI is an evaluation-first observability platform that scores every trace and span with 50+ metrics, alerting on quality drops in LLM and agent applications. It goes beyond traditional APM by treating evaluation as core observability, providing actionable insights that help teams understand not just whether their AI applications are running but whether they are producing correct and useful outputs.
Prompt fuzzing tool for LLM security testing
ps-fuzz by Prompt Security is a security testing tool with 660+ GitHub stars that fuzzes system prompts against dynamic LLM-based attack scenarios including jailbreaks, prompt injection, and data extraction attempts. It helps developers harden their GenAI applications by simulating adversarial attacks in a controlled environment, turning LLM security into a testable and reproducible quality gate.
Framework for evaluating LLM and agent performance
OpenAI Evals is an open-source framework and benchmark registry for evaluating LLM performance on custom tasks. It provides infrastructure for writing evaluation prompts, running them against models, and recording results in a structured format for comparison. The hosted Evals API on the OpenAI platform adds managed run tracking, dataset management, and programmatic access to evaluation pipelines. With 17,700+ GitHub stars, it serves as a foundation for systematic LLM quality measurement.
LLM testing and evaluation toolkit
Open-source agent framework for building autonomous AI systems with tool use, memory, and multi-step reasoning. Provides abstractions for creating agents that can plan, execute, and learn from interactions. Supports multiple LLM backends and integrates with popular vector stores and tool libraries, making it a flexible foundation for custom AI agent development.