Promptfoo

LLM testing and evaluation toolkit

open-sourceOpen Source

Promptfoo is an OpenAI-owned open-source toolkit for evaluating, red-teaming and securing LLM applications. It supports config-driven prompt/model tests, CI regression gates, red-team scans, guardrails, model security workflows, MCP Proxy, code scanning and evaluations across prompts, agents and RAG pipelines.

We have a review for this tool

A detailed review by the aicoolies team — click to read

Promptfoo is an open-source evaluation and AI-security toolkit for LLM applications, agents and RAG systems. It lets teams define prompts, providers, test cases and assertions in configuration, then run repeatable evaluations locally, in CI or through a web review workflow instead of relying on manual prompt checks.

The current official positioning is broader than prompt regression testing. Promptfoo now says it is part of OpenAI and highlights Red Teaming, Guardrails, Model Security, MCP Proxy, Code Scanning and Evaluations. That makes it relevant for security teams reviewing jailbreaks, unsafe tool use, prompt injection, model-risk gaps and MCP-mediated agent workflows.

Promptfoo works best as the evaluation and AI-security layer of an LLMOps stack. It can gate prompt and model changes before deployment, compare providers, and run adversarial tests, but teams may still need separate observability, tracing, production feedback and incident-response systems for live operations.

Pricing

Free open-source core; enterprise/security platform offerings under OpenAI-era Promptfoo positioning

Platforms

CLI, Node.js, Web UI, CI/CD, red-team/security workflows and MCP Proxy

Use Cases

Testing & QA Automation Vibe Coding

Alternatives

DSPy

Programming — not prompting — LLMs

Declarative framework from Stanford University for programming language models rather than prompting them. DSPy treats LLM interactions as programmable modules with input-output signatures and uses optimization algorithms to automatically compile these modules into effective prompts or fine-tuned weights, replacing brittle prompt strings with structured, modular AI software.

open-sourceOpen Source

BAML

Type-safe LLM function builder

BAML is a domain-specific language by BoundaryML for building reliable AI workflows and agents through schema engineering. It turns prompt engineering into a structured, type-safe discipline by letting developers declaratively define function schemas, validate LLM responses, and version prompts without fragile JSON parsing or boilerplate. BAML reframes prompt engineering as schema definition, making AI workflows testable and maintainable across models.

open-sourceOpen Source

Instructor

Structured LLM outputs with validation

Instructor is the most popular Python library for extracting structured, validated data from large language models, with over 3 million monthly downloads and ports across Python, TypeScript, Go, Ruby, Elixir, and Rust. It uses Pydantic models to define output schemas and automatically handles validation, retries, and error correction when the LLM output does not match. Instructor patches existing client libraries instead of replacing them, preserving full access to the underlying API.

open-sourceOpen Source

Agenta

Open-source LLMOps platform for prompt management and evaluation

Agenta is an open-source LLMOps platform that combines prompt engineering playgrounds, prompt version management, LLM evaluation, and observability in a unified interface. It supports 50+ LLM models with side-by-side prompt comparison, A/B testing, human evaluation workflows, and OpenTelemetry-native tracing. Self-hostable with 4,000+ GitHub stars.

open-sourceOpen Source

Related Tools

Hermes Agent

Top Pick

Open-source AI agent framework with persistent memory, reusable skills, tools, and messaging gateways

Hermes Agent is an open-source AI agent framework with persistent memory, reusable skills, 40+ tools, cron jobs, and messaging gateways.

open-sourceOpen Source

Safari MCP Server

Apple's Safari-native MCP server for web debugging agents

Safari MCP Server is Apple's safaridriver-based MCP server in Safari Technology Preview, giving compatible coding agents local access to Safari page content, console logs, network requests, screenshots, JavaScript evaluation, interactions, viewport controls, and accessibility/performance checks.

freeTelemetry

BeeAI Framework

Python and TypeScript framework for production multi-agent systems

BeeAI Framework is an Apache-2.0 toolkit for building production-ready AI agents and multi-agent systems in Python and TypeScript. Its docs cover agents, tools, RAG, memory, workflows, backend providers, serving, and A2A/MCP integration surfaces, making it a vendor-neutral option for teams comparing LangGraph, CrewAI, Mastra, and related agent runtimes.

open-sourceOpen SourceTelemetry

Superserve

Open-source Firecracker sandboxes for long-running AI agents

Superserve is an open-source sandbox infrastructure layer for AI agents that need durable computers instead of short-lived shells. It runs isolated Firecracker microVMs, supports pause, resume, snapshot, fork, preview URLs, MCP connectivity, SDK/API control, Docker workloads, and self-hosting, while the hosted service adds pay-as-you-go agent sandboxes for teams.

open-sourceOpen Source

Anthropic Agent Skills

Official Claude Agent Skills examples, spec, and plugin marketplace for reusable agent capabilities

Anthropic Agent Skills is Anthropic's official reference repo and Claude Code plugin marketplace for reusable Skill folders. It packages example SKILL.md workflows, document skills, a Claude API skill, templates, and the Agent Skills spec so teams can turn repeatable instructions, scripts, and resources into on-demand Claude capabilities instead of copying prompts across sessions.

freeTelemetry

agmsg

Cross-agent messaging for CLI coding agents

agmsg is an MIT-licensed Bash and SQLite messaging layer for CLI coding agents. It lets Claude Code, Codex, Gemini CLI, GitHub Copilot CLI, Antigravity, OpenCode, Hermes, and other terminal agents exchange messages through a shared local database instead of relying on a human copy-paste relay. It is intentionally not MCP, not a broker, and not a subagent framework.

open-sourceOpen Source

Used in Stacks

Production LLM Evaluation Stack

A production LLM evaluation stack should catch regressions before release, probe security failures, and close the loop with real traces and user feedback. This stack combines Promptfoo for CI gates, DeepEval/OpenAI Evals for metric-heavy test suites, and Langfuse or Helicone for observability and production datasets.

varies

AI Agent Red-Teaming and Evaluation Stack

Stress-test LLM applications against OWASP threats with security scanning, evaluation frameworks, and content safety models.

$0/mo

Comparisons

Giskard vs Promptfoo — AI Security Scans or CI Prompt Red Teaming

Giskard and Promptfoo both improve LLM quality and safety, but they enter the workflow from different sides. Giskard is stronger for automated AI risk scanning, while Promptfoo is stronger for developer-owned prompt regression and red-team testing.

GiskardPromptfoo

OpenAI Evals vs Promptfoo — Benchmark Harness or Prompt Regression Matrix

OpenAI Evals and Promptfoo both help teams evaluate model behavior, but they serve different operating rhythms. OpenAI Evals is closer to a benchmark and eval registry, while Promptfoo is built for practical prompt, model, and red-team regression testing in development workflows.

OpenAI EvalsPromptfoo

DeepEval vs Promptfoo — Pytest-Style LLM Testing vs CLI-First Evaluation Framework

DeepEval and Promptfoo are the two most popular open-source LLM evaluation frameworks, but they target different developer workflows. DeepEval integrates with pytest for unit-testing-style LLM evaluations with 50+ built-in metrics. Promptfoo provides a CLI-first approach with YAML configuration for prompt comparison and red-teaming. This comparison helps ML engineers choose the right evaluation foundation for their LLM quality assurance.

DeepEvalPromptfoo

RAGAS vs DeepEval vs Promptfoo — LLM Evaluation Framework Comparison

Three open-source frameworks for evaluating LLM application quality. RAGAS specializes in RAG pipeline metrics, DeepEval brings pytest-style unit testing to LLM outputs, and Promptfoo provides a CLI-first approach to prompt testing with red-teaming capabilities.

RAGASDeepEvalPromptfoo

Promptfoo

Pricing

Platforms

Categories

Tags

Use Cases

Alternatives

DSPy

BAML

Instructor

Agenta

Related Tools

Hermes Agent

Safari MCP Server

BeeAI Framework

Superserve

Anthropic Agent Skills

agmsg

Used in Stacks

Comparisons

Giskard vs Promptfoo — AI Security Scans or CI Prompt Red Teaming

OpenAI Evals vs Promptfoo — Benchmark Harness or Prompt Regression Matrix

DeepEval vs Promptfoo — Pytest-Style LLM Testing vs CLI-First Evaluation Framework

RAGAS vs DeepEval vs Promptfoo — LLM Evaluation Framework Comparison