UI-TARS Desktop is an open-source computer-use agent developed by ByteDance that takes a fundamentally different approach to GUI automation. Instead of relying on DOM inspection, accessibility trees, or element selectors, it uses a multimodal vision model called UI-TARS to understand screen content directly from screenshots. This vision-first approach means it can automate any application with a graphical interface — native desktop apps, web applications, mobile emulators, and remote desktop sessions — without requiring application-specific integration code.
The architecture consists of a desktop application built with Electron that captures screenshots, sends them to the UI-TARS model for interpretation, and executes the model's suggested actions through mouse and keyboard events. The system supports both local model inference and cloud-hosted endpoints. Operators can define goals in natural language, and the agent decomposes them into step-by-step GUI interactions. The built-in action history and screenshot recording provide full observability into what the agent did and why.
With over 29,000 GitHub stars, UI-TARS Desktop represents one of the most significant open-source contributions from ByteDance to the developer tools ecosystem. The project is Apache 2.0 licensed and supports Windows, macOS, and Linux. It fills a gap in the automation landscape between browser-only tools like Playwright and API-based agents — providing a universal automation layer that works with any software that has a screen. The underlying UI-TARS model family has shown strong results on computer-use benchmarks.