Promptfoo is an open-source tool for testing, evaluating, and red-teaming large language model (LLM) applications. It solves the challenge of systematically measuring and comparing LLM output quality by providing a framework for defining test cases, running prompts against multiple models and configurations, and evaluating results using automated scoring methods. Promptfoo enables developers to treat prompt engineering as a rigorous, data-driven discipline rather than a trial-and-error process, bringing software testing best practices to the emerging field of AI application development.
Promptfoo supports testing against any LLM provider including OpenAI, Anthropic, Google, Mistral, and local models, with side-by-side comparison of outputs across different models and prompt variations. It provides built-in evaluation metrics including factuality, relevance, toxicity, and custom assertions, a web UI for reviewing and comparing results, and CI/CD integration for automated regression testing of prompts. Promptfoo also includes red-teaming capabilities for identifying jailbreaks, hallucinations, and safety vulnerabilities in LLM applications, with configurable attack strategies and risk categories.
Promptfoo is designed for AI engineers, ML engineers, and developers building LLM-powered applications who need to ensure consistent output quality, compare model performance, and detect regressions when prompts or models change. It integrates with popular LLM frameworks and APIs, supports RAG pipeline evaluation, and can be embedded into CI/CD workflows for automated quality gates. Promptfoo is particularly valuable for teams deploying LLM applications in production who need confidence that changes to prompts, models, or retrieval systems will not degrade the user experience.