LLM evaluation and prompt engineering platform
Braintrust is an LLM evaluation platform for testing, scoring, and iterating on AI applications with dataset-centric regression testing. Features a prompt playground for rapid experimentation, automated evaluation with custom scorers and LLM judges, dataset management for building test suites from production data, and detailed tracing for debugging. Supports A/B testing of prompts, comparison across model providers, and CI/CD integration for automated quality gates on LLM outputs.