AI Agent Red-Teaming and Evaluation Stack

Stress-test LLM applications against OWASP threats with security scanning, evaluation frameworks, and content safety models.

What This Stack Does

PurpleLlama provides Meta's Llama Guard models for content safety classification and CodeShield for insecure code detection. Agentic Radar scans agent workflows for MCP vulnerabilities. DeepEval offers pytest-native LLM testing with 50+ metrics. Promptfoo adds CLI-first evaluation with red-teaming attack generation. TruLens provides experiment tracking with RAG Triad metrics. Langfuse delivers observability for production monitoring.

The Bottom Line

Deploy PurpleLlama for content safety, Agentic Radar for agent architecture auditing, DeepEval and Promptfoo for pre-deployment testing, TruLens for experiment tracking, and Langfuse for production observability. The stack covers the complete lifecycle from development testing to production monitoring.

Stack Overview

Tool	Role	Pricing	Open Source
PurpleLlama	Content Safety Models	Free and open-source (custom Meta license)	Yes
Agentic Radar	Agent Security Scanner	Free and open-source	Yes
DeepEval	LLM Unit Testing	Free open-source / Confident AI cloud for dashboard	Yes
Promptfoo	Red-Teaming & Eval	Free (open-source) / Enterprise available	Yes
TruLens	Experiment Tracking	Free and open-source (MIT)	Yes
Langfuse	Production Observability	Hobby free / Core from $29/mo / Pro from $199/mo	Yes