Stagehand bridges the gap between traditional browser automation (Playwright/Selenium) and AI-powered web interaction. Built by Browserbase, it adds an AI vision layer on top of Playwright that understands web pages contextually rather than relying on brittle selectors.
Three core primitives power the framework: act() performs actions described in natural language, extract() pulls structured data from pages using plain English descriptions, and observe() analyzes the current page state. These replace complex CSS/XPath selector chains with intuitive language.
Under the hood, Stagehand takes screenshots, processes them through vision models to understand page layout and element purposes, and maps natural language instructions to specific DOM interactions via Playwright. This makes automations resilient to UI changes that would break traditional selector-based scripts.
The framework supports multiple LLM providers for the vision component and runs on any platform Playwright supports. It is particularly valuable for building AI agents that need to interact with websites that frequently change their UI or lack APIs.