W&B Weave extends the Weights & Biases platform into LLM application observability. By adding a simple @weave.op decorator to Python functions, developers get automatic tracing of all LLM calls, tool invocations, and agent steps with full input/output logging, token counts, latency measurements, and cost calculations. The trace explorer visualizes complex multi-step agent workflows as navigable trees, making it straightforward to identify where failures or quality issues occur in production applications.
The evaluation framework lets teams build systematic test suites for LLM applications using custom scorers and curated datasets. Evaluations can compare prompt variants, model versions, and configuration changes side-by-side with metrics tracked over time. Weave supports both automated scoring through LLM judges and human feedback collection, enabling teams to combine programmatic and qualitative evaluation. The playground feature provides a quick interface for testing prompts across different models before deploying changes.
Weave is part of the broader W&B ecosystem that includes experiment tracking, model registry, and data versioning. It provides Python and TypeScript SDKs with integrations for OpenAI, Anthropic, Google, LangChain, CrewAI, Amazon Bedrock, and other popular frameworks. The platform offers free, team, and enterprise tiers with self-hosted and cloud deployment options. For teams already using W&B for model training who are now building LLM applications, Weave provides a natural extension of their observability stack.