OpenSRE is an open-source toolkit from Tracer Cloud for building AI SRE agents that investigate and respond to production incidents. Rather than a generic chat-over-your-logs product, OpenSRE provides the scaffolding — connectors to common observability stacks, an incident workflow state machine, and an evaluation harness — so teams can assemble an agent that behaves like a junior on-call: pulling metrics, correlating traces, reading recent deploys, and proposing a root cause hypothesis.
Out of the box the framework integrates with the usual SRE surface: Prometheus, Grafana, Datadog-style metric queries, log backends, Kubernetes, and incident platforms. A notable design choice is the simulation and benchmarking layer: teams can replay past incidents against the agent to measure how well it diagnoses real outages before letting it touch a live pager. That makes OpenSRE easier to trust in production than a from-scratch LangChain pipeline.
The project is Apache-2.0 licensed and written in Python, which fits the typical DevOps toolchain and makes custom connectors straightforward to add. It is a strong fit for platform and SRE teams who want an agentic incident workflow they can self-host, extend, and evaluate — without buying into a closed AIOps vendor stack.
