What AgentOps Does
AgentOps is observability for AI agents the way Datadog is observability for backends. Once your agent moves from a demo into production, every failure becomes a forensic problem: which model call returned the wrong tool name, which tool execution raised, where did the retry loop start. AgentOps instruments the agent at the SDK level and records every LLM call, tool invocation, error, and decision into a replay-able session trace, so the post-mortem starts with evidence instead of speculation.
Session Replay and Multi-Agent Tracing
The session replay is the headline feature. Every agent run becomes a single playable trace: model calls in order, tool inputs and outputs, latencies, costs, and any exceptions, all visible in one timeline rather than scattered across logs. For multi-agent workflows — CrewAI crews, AutoGen group chats, LangGraph state machines — the trace stitches the agents together and shows which agent invoked which, with attribution per step. This is the part that makes AgentOps qualitatively different from generic LLM logging: you can watch an agent fail rather than reconstruct the failure from text.
Tool-call attribution matters more than it sounds. When an agent uses five tools and the final answer is wrong, the question is usually not which model produced the answer but which tool returned the bad data that the model trusted. AgentOps surfaces the inputs and outputs of every tool call inline with the LLM messages, so the chain of evidence is intact. Sessions can be tagged, filtered, and shared with a link — useful when a debugging conversation needs to leave Slack and enter a ticket.
Framework Integrations and Setup
Setup is the quietly compelling part. Two lines of Python — import agentops and agentops.init(api_key) — and the SDK auto-instruments OpenAI, Anthropic, LangChain, LlamaIndex, CrewAI, and AutoGen calls without any further wiring. Sessions and events appear in the dashboard within seconds. For frameworks AgentOps does not auto-instrument, the manual decorator API takes a few more lines and gives explicit control over what gets recorded.
What gets captured by default is generous: model name and version, prompt and completion, token counts, cost, latency, tool name, tool input, tool output, and any raised exception. PII redaction is opt-in rather than default, which is a real trade-off for teams handling user data — instrumenting an agent that talks to customers means traces will contain customer text unless you wire up masking before init. The SDK supports redaction hooks but they are your responsibility to implement.
Cost Tracking and Production Budgets
Per-run cost breakdown is one of the underrated features. AgentOps surfaces dollar cost per session, per agent, and per model, broken down by prompt vs completion tokens. For teams running agents in production, this turns a vague monthly OpenAI bill into a per-workflow line item: which agent costs the most, which user query is the most expensive, where the budget is going. Cost alerts can be configured to fire when a session crosses a threshold.