aicoolies logo
SWE-bench logo

SWE-bench

Benchmark for evaluating AI coding agents on real GitHub issues

Share
open-sourceOpen Source
Visit Website →

SWE-bench is a benchmark from Princeton NLP that evaluates AI coding agents by testing their ability to resolve real GitHub issues from popular open-source projects. Each task provides an issue description and repository state, and the agent must produce a working patch that passes the project's test suite. With 4,600+ GitHub stars, it has become the standard yardstick for comparing autonomous coding tools like Devin, Claude Code, and OpenHands.

SWE-bench is a benchmark dataset and evaluation framework created by researchers at Princeton's NLP group that measures how well AI systems can solve real-world software engineering tasks. Unlike synthetic coding benchmarks that test isolated function generation, SWE-bench uses actual GitHub issues from twelve popular Python repositories including Django, Flask, scikit-learn, and matplotlib. Each task consists of a natural language issue description, the repository at the commit where the issue was filed, and a set of tests that validate whether the proposed fix actually resolves the problem.

The benchmark has become the de facto standard for evaluating autonomous coding agents. When companies like Cognition demonstrate Devin, or when Anthropic reports Claude Code's capabilities, or when OpenHands publishes agent performance data, they reference SWE-bench scores. The evaluation uses Docker-based reproducible environments to ensure that test results are consistent across different hardware and software configurations. SWE-bench Verified is a curated subset of 500 tasks reviewed by software engineers to confirm that each task is unambiguous and solvable.

Beyond raw benchmarking, SWE-bench has influenced how the industry thinks about AI coding evaluation. It demonstrated that generating syntactically correct code is insufficient — agents must understand project architecture, navigate large codebases, identify relevant files, and produce patches that integrate correctly with existing test suites. The benchmark is MIT licensed with over 4,600 GitHub stars and continues to be maintained with new evaluation variants. For teams building or selecting AI coding tools, SWE-bench provides the only widely accepted, reproducible metric for comparing agent capabilities on realistic software engineering work.

Pricing

Free and open-source under MIT license

Platforms

Python, Docker (evaluation framework)

Categories

Tags

Use Cases

Alternatives

DeepEval logo

DeepEval

Apache-2.0 Python framework for repeatable LLM, RAG, agent, MCP, and safety evaluation workflows.

DeepEval is an Apache-2.0 Python framework for evaluating LLM apps, RAG systems, agents, MCP workflows, and safety behavior with repeatable test cases. It works locally and in CI/CD, then connects to Confident AI for hosted reports, observability, red teaming, and governance when teams need shared evidence instead of ad-hoc prompt reviews and manual QA.

open-sourceOpen Source
Promptfoo logo

Promptfoo

LLM testing and evaluation toolkit

Promptfoo is an OpenAI-owned open-source toolkit for evaluating, red-teaming and securing LLM applications. It supports config-driven prompt/model tests, CI regression gates, red-team scans, guardrails, model security workflows, MCP Proxy, code scanning and evaluations across prompts, agents and RAG pipelines.

open-sourceOpen Source

TruLens

LLM evaluation and tracking with RAG triad metrics

TruLens is an open-source framework for evaluating and tracking LLM experiments with feedback functions, RAG triad metrics (answer relevance, context relevance, groundedness), and Honest/Harmless/Helpful evaluations. Features a unified Metric API for systematic evaluation of RAG pipelines and AI agents. 3,200+ GitHub stars, MIT licensed. Snowflake partnership adds enterprise integration. Supports LangChain, LlamaIndex, and custom LLM applications.

open-sourceOpen Source
Braintrust logo

Braintrust

LLM evaluation and prompt engineering platform

Braintrust is an AI observability and evaluation platform for tracing LLM applications, building datasets, running prompt/model experiments, scoring outputs and turning production feedback into regression tests. It fits teams that need repeatable quality gates for AI releases rather than one-off prompt demos.

freemium

Related Tools

Claude Code logo

Claude Code

Top Pick

Anthropic's agentic coding CLI

Anthropic's agentic CLI coding tool that delegates complex tasks to Claude directly from the terminal. Understands entire codebases via automatic context gathering, edits multiple files, runs shell commands, and manages Git workflows autonomously. Supports CLAUDE.md for persistent project instructions, integrates with VS Code and JetBrains, and uses Claude Opus/Sonnet with extended thinking for complex architectural decisions. Built for terminal-first developers.

paidOpen Source
Cursor logo

Cursor

Top Pick

The AI-first code editor

AI-first code editor built as a VS Code fork that deeply integrates LLMs into every part of the development workflow. Features Tab autocomplete with multi-line predictions, Cmd+K inline editing, AI chat with full codebase awareness, and Agent mode for autonomous multi-file edits with terminal execution. Supports GPT-4, Claude, and more with automatic context from project files and docs. Includes privacy mode for SOC 2 compliance. The leading AI-native IDE with 100K+ paying users.

freemiumTelemetry
OpenCode logo

OpenCode

Top Pick

Open-source AI coding agent for the terminal

Open-source terminal-based AI coding agent built in Go by the SST team, with a rich TUI (Bubble Tea) supporting 75+ model providers including OpenAI, Anthropic, Gemini, Bedrock, Groq, and OpenRouter. Features vim-like editing, persistent SQLite sessions, and LSP integration for 40+ languages. Fully free with no vendor lock-in, it has rapidly grown to 95k+ GitHub stars.

open-source
Codex logo

Codex

Top Pick

OpenAI coding agent for app, editor, terminal, and cloud work

Codex is OpenAI's coding agent for software development across the Codex app, editor, terminal, and cloud tasks. It helps write, review, debug, refactor, and automate code, with ChatGPT plan access for managed surfaces and API-key usage for CLI, SDK, and IDE workflows. The open-source CLI and SDK support local repository work, while cloud features add GitHub review, Slack/Linear integrations, worktrees, skills, MCP, and automations.

freemiumOpen Source

Accomplish Coworker

Open-source desktop AI coworker for browsing and code execution.

Accomplish Coworker is an MIT-licensed open-source AI coworker that runs on the desktop, combining computer-use style browsing with code execution so agents can research, implement, run, and debug workflows in one local environment.

open-sourceOpen SourceTelemetry

Safari MCP Server

Apple's Safari-native MCP server for web debugging agents

Safari MCP Server is Apple's safaridriver-based MCP server in Safari Technology Preview, giving compatible coding agents local access to Safari page content, console logs, network requests, screenshots, JavaScript evaluation, interactions, viewport controls, and accessibility/performance checks.

freeTelemetry

Used in Stacks