
AI-powered UI automation

Midscene.js is an innovative, vision-driven automation framework that enables developers to control any platform using natural language. By leveraging advanced multimodal vision models, Midscene transcends the limitations of traditional DOM-based or accessibility-tree-based automation. It "sees" the UI just like a human, allowing it to interact with web browsers, desktop applications (macOS, Windows, Linux), and mobile devices (Android, iOS, HarmonyOS) with high precision. The tool features a unified API design, making it easy to integrate into existing workflows alongside Puppeteer or Playwright. With support for multiple vision models like Doubao Seed, Qwen3-VL, and Gemini-3-Pro, Midscene allows developers to balance performance and cost while building robust, self-healing automation scripts that adapt to UI changes effortlessly.
Uses multimodal AI to interpret UI elements visually, bypassing the need for brittle selectors or DOM-specific code.
A unified API that works seamlessly across Web, Desktop (macOS/Windows/Linux), and Mobile (Android/iOS/HarmonyOS).
Supports a variety of vision models, allowing you to swap between Doubao, Qwen, and Gemini to optimize for cost and accuracy.
Provides atomic control through functions like aiAct and aiLocate, while supporting complex agent-based workflows.
Includes a built-in Playground and reporting tools to visualize exactly what the AI sees and how it executes actions.
Offers drop-in Agent Skills and an MCP server, enabling seamless collaboration with other AI coding tools.
Install the Midscene package via npm or yarn into your existing project.,Configure your preferred vision model (e.g., Gemini, Qwen, or Doubao) in the environment settings.,Use the intuitive API (aiAct, aiLocate, aiAssert) to define automation steps using natural language instructions.,Execute your automation script to allow the vision model to interpret the UI and perform actions.,Review the generated visualization reports in the Playground to trace and debug the automation flow.,Refine your workflow using YAML configurations or custom Agent strategies for complex tasks.
Automate complex user journeys across web and mobile apps without writing fragile, selector-heavy test scripts.
Create unified scripts that perform repetitive tasks across desktop software and web browsers simultaneously.
Interact with desktop or mobile applications that lack modern accessibility APIs by using visual recognition.
Professionals looking to build resilient, self-healing test suites that don't break when UI elements change.
Engineers needing to automate browser or desktop tasks using natural language instead of complex boilerplate code.
Technical leads building cross-platform automation infrastructure that requires high flexibility and AI-driven logic.
Midscene is free and open-source software released under the MIT license, allowing for unrestricted use and integration into your own projects.