aicoolies logo
Agenta logo

Agenta

Open-source LLMOps platform for prompt management and evaluation

Share
open-sourceOpen Source
Visit Website →

Agenta is an open-source LLMOps platform that combines prompt engineering playgrounds, prompt version management, LLM evaluation, and observability in a unified interface. It supports 50+ LLM models with side-by-side prompt comparison, A/B testing, human evaluation workflows, and OpenTelemetry-native tracing. Self-hostable with 4,000+ GitHub stars.

Agenta unifies the scattered workflow of prompt engineering into a single platform where teams can experiment, evaluate, deploy, and monitor LLM-powered features. The prompt playground provides side-by-side comparison of outputs across different models, prompt variants, and parameter configurations, making it easy to identify which combination produces the best results for specific use cases. Prompt versions are tracked with full history, enabling teams to roll back to previous configurations when new prompts underperform and maintain an audit trail of all changes to production prompts.

The evaluation system supports multiple assessment methodologies including automated LLM-as-judge scoring, custom Python evaluation functions, A/B testing with statistical significance calculation, and human evaluation workflows where domain experts rate outputs against defined criteria. Evaluation results are tied to specific prompt versions, creating a data-driven development cycle where every prompt change is validated against objective metrics before deployment. The platform's OpenTelemetry-native observability layer provides distributed tracing for production LLM calls, enabling teams to monitor latency, token usage, error rates, and custom quality metrics across their deployed prompt configurations.

Agenta differentiates from more focused tools like Langfuse (primarily observability) and Promptfoo (primarily evaluation) by integrating the entire prompt engineering lifecycle in one interface. The platform supports over 50 LLM models through direct API integrations, works with any Python-based LLM application through a lightweight SDK, and can be self-hosted via Docker for organizations with data residency requirements. With 4,000+ GitHub stars and active development, Agenta serves teams that want a comprehensive prompt operations platform without stitching together multiple specialized tools.

Pricing

Free self-hosted (Apache-2.0); Agenta Cloud freemium

Platforms

Docker self-hosted or Agenta Cloud SaaS

Categories

Tags

Use Cases

Alternatives

Related Tools

Safari MCP Server

Apple's Safari-native MCP server for web debugging agents

Safari MCP Server is Apple's safaridriver-based MCP server in Safari Technology Preview, giving compatible coding agents local access to Safari page content, console logs, network requests, screenshots, JavaScript evaluation, interactions, viewport controls, and accessibility/performance checks.

freeTelemetry

Latitude

Sentry-style observability for AI agent conversations

Latitude is an agent observability platform for teams that need to inspect LLM traces, conversations, issues, and evaluation feedback in one workflow. Its public repo and docs position it as a Sentry-style monitor for AI agents, with semantic search, issue detection, annotations, MCP-assisted fixes, and cloud or self-hosted deployment paths for production debugging.

freemiumOpen SourceTelemetry

Spotlight by Backplanes

Session reports for Claude Code and Codex runs

Spotlight by Backplanes turns completed Claude Code and Codex sessions into concise reports for engineering, security, and spend review. The CLI installs on macOS, Linux, or WSL 2, watches sessions after they finish, redacts PII and credentials locally before upload, then summarizes files touched, commands run, external domains reached, scope drift, risky actions, and next-session improvements.

freemiumTelemetry
rampart

Rampart

Microsoft’s pytest-native red teaming framework for turning AI agent safety findings into CI tests.

RAMPART is an open-source Microsoft framework for safety and security testing of agentic AI applications. It brings red-team findings into a pytest-native workflow so teams can turn prompt injection, unsafe tool use, and behavioral boundary failures into repeatable regression tests. The strongest aicoolies angle is developer workflow: RAMPART makes agent safety part of CI/CD instead of a one-off security review.

open-sourceOpen Source
Traceway logo

Traceway

OpenTelemetry-native observability with AI tracing, logs, traces, metrics, and session replay — self-hosted in 90 seconds.

Traceway is an open-source, OpenTelemetry-native observability platform that combines logs, traces, metrics, exceptions, session replay, and AI tracing in a single self-hosted system. MIT licensed with no open-core restrictions, it deploys in 90 seconds via Docker Compose and accepts OTLP/HTTP from any OTel SDK without a Collector or per-language vendor SDK.

open-sourceOpen Source
Judgeval logo

Judgeval

Open-source post-building layer for agents — tracing, evals, and online monitoring

Judgeval is the open-source post-building layer for AI agents from Judgment Labs, providing OpenTelemetry-based tracing, hosted and custom evaluation scorers, and online behavior monitoring for LLM-powered applications. Instrument any function with a single decorator, score live production traffic against faithfulness and instruction-adherence checks, and feed real-world failures back into reinforcement learning or supervised fine-tuning loops.

open-sourceOpen Source