aicoolies logo
Midscene.js logo

Midscene.js

AI-powered vision-driven UI automation for web, Android, and iOS

Share
open-sourceOpen Source
Visit Website →

Midscene.js is an open-source UI automation framework from ByteDance's Web Infra team that uses vision-based AI models to understand and interact with interfaces. It replaces fragile CSS selectors with natural language descriptions, supporting web browsers via Playwright and Puppeteer, Android via ADB, and iOS via WebDriverAgent from a unified JavaScript SDK.

Midscene.js introduces a paradigm shift in UI automation by replacing traditional DOM selector approaches with vision-based AI understanding. Instead of writing brittle selectors like clicking on a specific CSS class, you describe interactions in natural language such as click the login button and the AI model visually locates and interacts with the correct element. This approach survives UI redesigns that would break conventional selectors, dramatically reducing test maintenance overhead.

The framework supports multiple platforms from a single JavaScript SDK. Web automation works through Playwright or Puppeteer integration, Android automation connects via ADB, and iOS automation uses WebDriverAgent. A Chrome Extension provides immediate in-browser experimentation without any code setup. The MCP server integration exposes Midscene actions as tools for AI agents, enabling higher-level automation orchestration through natural language.

Built by ByteDance's Web Infra team and released under the MIT license with over 8,000 GitHub stars, Midscene.js supports multiple AI backends including GPT-4o, Claude, Gemini, Qwen-VL, and ByteDance's open-source UI-TARS model for self-hosted deployments. The caching system records AI planning results so repeated test runs execute at near-native automation speeds. Production users report testing costs as low as two dollars per day.

Pricing

Free and open-source (MIT); AI model API costs vary by provider

Platforms

JavaScript/TypeScript, npm, Playwright/Puppeteer, Android, iOS

Categories

Tags

Use Cases

Alternatives

Browser Use logo

Browser Use

AI agent framework for web browser automation

Browser Use is an open-source AI agent framework with 99K+ GitHub stars enabling LLMs to control web browsers via natural language. Y Combinator-backed, it lets agents navigate sites, fill forms, extract data, and complete multi-step tasks autonomously. Built on Playwright with vision-based element detection, multi-tab management, cookie persistence, and self-correcting actions. Supports OpenAI, Anthropic, and local models with a simple Python API for building custom browser agents.

open-sourceOpen Source
Stagehand logo

Stagehand

AI-powered web browser automation with Playwright

Stagehand is an open-source browser-agent SDK from Browserbase that combines deterministic browser automation with AI primitives such as act(), extract(), observe(), and agent(). Instead of relying only on brittle selectors, developers can use natural-language actions, Zod-backed structured extraction, page observation, action caching, and Browserbase cloud-browser infrastructure for production web automation.

open-sourceOpen Source
Playwright logo

Playwright

Reliable end-to-end testing

Cross-browser E2E testing framework by Microsoft supporting Chromium, Firefox, and WebKit with one API. Features auto-waiting, tracing with timeline/screenshots/DOM snapshots, codegen for recording tests, and parallel execution. Component testing for React, Vue, Svelte. Built-in API testing, network mocking, and mobile emulation. Known for reliability and speed vs Selenium/Cypress. 70K+ GitHub stars, rapidly becoming the E2E standard.

open-sourceOpen Source

Related Tools

Accomplish Coworker

Open-source desktop AI coworker for browsing and code execution.

Accomplish Coworker is an MIT-licensed open-source AI coworker that runs on the desktop, combining computer-use style browsing with code execution so agents can research, implement, run, and debug workflows in one local environment.

open-sourceOpen SourceTelemetry

Safari MCP Server

Apple's Safari-native MCP server for web debugging agents

Safari MCP Server is Apple's safaridriver-based MCP server in Safari Technology Preview, giving compatible coding agents local access to Safari page content, console logs, network requests, screenshots, JavaScript evaluation, interactions, viewport controls, and accessibility/performance checks.

freeTelemetry
BrowserOS logo

BrowserOS

Open-source agentic browser that runs local AI agents in your browsing workflow.

BrowserOS is a privacy-first, open-source agentic browser for running AI assistants locally inside real browsing sessions instead of handing every task to a remote cloud browser.

open-sourceOpen Source
Webwright logo

Webwright

Microsoft browser agent that turns long-horizon web tasks into reusable Playwright code

Webwright is a Microsoft browser-agent project that asks coding models to write, debug, and reuse Playwright scripts instead of relying on one-off stochastic click loops. The approach gives automation teams a more inspectable artifact: scripts can be logged, reviewed, rerun, and maintained like normal test or scraping code. It is especially relevant for long-horizon browser tasks where teams care about determinism, auditability, and resilience to UI changes.

open-sourceOpen Source
rampart

Rampart

Microsoft’s pytest-native red teaming framework for turning AI agent safety findings into CI tests.

RAMPART is an open-source Microsoft framework for safety and security testing of agentic AI applications. It brings red-team findings into a pytest-native workflow so teams can turn prompt injection, unsafe tool use, and behavioral boundary failures into repeatable regression tests. The strongest aicoolies angle is developer workflow: RAMPART makes agent safety part of CI/CD instead of a one-off security review.

open-sourceOpen Source
Requestly logo

Requestly

One tool for intercepting, mocking, and replaying HTTP — acquired by BrowserStack

Requestly is a BrowserStack-backed API client, HTTP interceptor, mock server, and session replay tool for frontend and QA teams. Its current product is commercial/API-client led, while the legacy interceptor/open-source code is AGPLv3. The free plan covers individual workflows, and Pro lists at $12/user/month monthly or $9/user/month annually for collaborative QA and frontend debugging teams.

freemium

Comparisons