Advancements in GUI Agents and Automation

The field of Graphical User Interface (GUI) agents and automation is rapidly advancing, with a focus on developing more intelligent, adaptable, and deployable systems. Recent developments have seen the integration of large language models (LLMs) and multimodal learning, enabling GUI agents to better understand and interact with complex interfaces. Notable advancements include the development of visual world models, such as ViMo, which generate future GUI observations as images, and the introduction of frameworks like LearnAct, which enhance mobile GUI agent capabilities through human demonstrations. Additionally, researchers are exploring new approaches to GUI automation, including the use of process rewards and intent-based affordances, to improve the efficiency and effectiveness of GUI agents. Overall, the field is moving towards more sophisticated and autonomous GUI agents that can seamlessly interact with humans and perform complex tasks. Noteworthy papers include LearnAct, which proposes a unified demonstration benchmark for mobile GUI agents, and InfiGUI-R1, which introduces a reasoning-centric framework for advancing GUI agents from reactive actors to deliberative reasoners. UFO2 is also notable, as it presents a multiagent AgentOS for Windows desktops that enables practical, system-level automation.

Sources

Cellular-X: An LLM-empowered Cellular Agent for Efficient Base Station Operations

LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark

A Survey on (M)LLM-Based GUI Agents

ViMo: A Generative Visual GUI World Model for App Agent

Terminal Lucidity: Envisioning the Future of the Terminal

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

Toward Generation of Test Cases from Task Descriptions via History-aware Planning

UFO2: The Desktop AgentOS

Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction

Towards Test Generation from Task Description for Mobile Testing with Multi-modal Reasoning

Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation

PixelWeb: The First Web GUI Dataset with Pixel-Wise Labels

ViMoTest: A Tool to Specify ViewModel-Based GUI Test Scenarios using Projectional Editing

Cracking the Code of Action: a Generative Approach to Affordances for Reinforcement Learning

Built with on top of