
Left: The comparison chart of our grounding model results across five GUI grounding benchmarks. Our model, trained specifically for the agent setting, achieved SOTA results on all benchmarks under this focus. Even in the general end-to-end model setting, our model attained SOTA results on three of the benchmarks. Right: The relationship between model performance and computational cost on ScreenSpot-pro demonstrates that our model supports the Pareto frontier, indicating its efficiency. Most GUI research traditionally considers only the parameter count N for comparison, but our experiments highlight that computational cost during testing, such as the number of image tokens, also significantly impacts performance. The X-axis in the right figure represents ND, where D is the number of image tokens. Training and inference latency are more linearly correlated with ND than with N. A graph using latency as the X-axis closely resembles the right figure, but latency is often influenced by hardware and acceleration libraries such as vllm, so we did not use latency as X-axis.
Abstract
With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from "Iron Man", are becoming a reality. GUI grounding is a crucial step for CUAs to perform concrete actions, as it determines the coordinates for clicks and other interactions. Current end-to-end grounding models still achieve less than 80% accuracy on challenging benchmarks like ScreenSpot-pro and UI-Vision, indicating they are far from being ready for deployment, as a single misclick can result in unacceptable consequences. In this work, we conduct an empirical study on the training of grounding models, examining every detail from data collection to model training. Ultimately, we developed the Phi-Ground model family, which achieves state-of-the-art performance across all five grounding benchmarks for models under 10B parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of 43.2 on ScreenSpot-pro and 27.2 on UI-Vision. We believe that the various details discussed in this paper, along with our successes and failures, have generalization potential for other perception tasks.

Agent evolution across physical and virtual worlds. Traditional systems rely on fixed controllers and pre-defined workflows to execute domain-specific tasks, either in physical environments (e.g., task-specific robots) or virtual environments (e.g., API-based Web/APP agents). In the modern era, intelligent automation has emerged. In the physical world, general-purpose robots perform versatile limb-based operations. In the virtual world, Computer Use Agents (CUAs) achieve human-level behaviors through general purpose planner, GUI grounding, enabling them to complete any virtual task achievable via mouse and keyboard interactions.

Three levels of task of CUAs. Each coral block represent an action step. CUA can be divided into two steps when completing a task: temporal planning and grounding. Planning involves analyzing the task description and the current state to determine the actions that should be taken in the future, w hile grounding refers to the execution of these specific actions. In the context of Computer Use scenarios, grounding primarily involves generating computer-interactive commands, which include keyboard, mouse instructions and etc. Since keyboard commands, such as pressing the "A" key, are discrete, MLLMs can effectively handle this type of grounding. Our focus is primarily on mouse commands, where the main challenge lies in the fact that mouse command parameters are screen coordinates, and most MLLMs struggle to accurately identify these coordinates. Therefore, specialized training is required for determining the precise click coordinates.

CommonCrawl data processing pipeline.

Examples of benchmarks used in evaluation. To ensure the model's generalization capability and to avoid systematic overfitting to well-known benchmarks such as ScreenSpot, we have gathered several recent open-source and internally developed evaluation datasets. This approach aims to ensure the comprehensiveness of our testing.

Up: Illustration of the impact of modal input order on model training. Down: Comparison of input order of modalities.

Illustration of the evaluation results in relation to the training computation load. The Y-axis represents the benchmark scores in click accuracy, while the X-axis denotes the training computation per sample in TFLOPs. This training computation is estimated using the formula FLOPs = 6ND, where N is the number of image tokens and D is the number of model parameters.

Types and Proportions of Errors on the ScreenSpot-pro Benchmark. In each image, the red rectangles represent the regions corresponding to the ground truth. Red circles indicate erroneous outputs from the previous stage, while green circles denote correct outputs from the current stage. The centers of the green circles fall within the ground truth boundaries. To avoid obstructing the image content, we have enlarged the green circles in some of the images.
Key results

The comparison of results across five GUI grounding test sets, which were tested by us, is presented.

Detailed ScreenSpot-V2 results.

Detailed ScreenSpot-Pro result

Results of UI-Vision.

Showdown-click-dev results. For the latency of the models we tested, we report the inference speed of the models accelerated using the vllm Python library if supported, otherwise we report the latency using huggingface transformers, marked with 'hf'. For the settings with GPT-4O and O4-mini as planners, we directly added 2.5 seconds (aligned with the original benchmark) and 8 seconds (our tested average level, which may be highly dependent on the endpoint) to the original model latency, respectively. *: Results from the original GitHub repository.

.png)

