GTA1: A Test-Time Scaled GUI Agent Outperforms OpenAI's CUA

10 hours ago 2

Salesforce AI Research has introduced GTA1, a new graphical user interface (GUI) agent that redefines the state-of-the-art in agentic human-computer interaction. Designed to autonomously operate in real operating system environments such as Linux, GTA1 addresses two critical bottlenecks in GUI agent development: ambiguous task planning and inaccurate grounding of actions. With a 45.2% task success rate on the OSWorld benchmark, GTA1 surpasses OpenAI’s CUA (Computer-Using Agent), establishing a new record among open-source models.

Core Challenges in GUI Agents

GUI agents typically translate high-level user instructions into action sequences—clicks, keystrokes, or UI interactions—while observing UI updates after each action to plan subsequent steps. However, two issues persist:

Planning Ambiguity: Multiple valid action sequences can fulfill a task, leading to execution paths with varying efficiency and reliability.
Grounding Precision: Translating abstract action proposals into accurate, coordinate-level GUI interactions is especially challenging in high-resolution, dynamic interfaces.

GTA1 introduces novel mechanisms to resolve both.

Smarter Planning via Test-Time Scaling

Traditional planners commit to a single action proposal at each decision point, limiting robustness. GTA1’s test-time scaling introduces a simple yet effective solution: concurrently sample multiple candidate actions at each step, and employ a multimodal judge model—typically a large language model—to evaluate and select the most appropriate one.

This technique avoids premature commitment to suboptimal plans and enables the agent to better explore execution paths without requiring future rollout, which is infeasible in GUI environments due to irreversible actions. Importantly, this method can work with any planner and scales well with increasing task complexity and action space size.

Reinforcement Learning for Grounding Accuracy

For GUI grounding, most prior models rely on supervised fine-tuning to predict the center of target UI elements, which limits generalization. GTA1 adopts a reinforcement learning (RL) framework based on Group Relative Policy Optimization (GRPO). Rather than relying on intermediate reasoning (“thinking”) or predicting bounding boxes, the model learns directly from click-based rewards: it is rewarded only when the predicted coordinate falls within the correct UI element.

Through this reward structure, GTA1 achieves state-of-the-art accuracy without the complexity or overhead of chain-of-thought style supervision. Notably, an ablation study shows that removing auxiliary signals such as “thinking” or IoU-based box rewards actually improves grounding performance—particularly in static environments.

Performance Across Benchmarks

GTA1 sets a new standard in several evaluations:

OSWorld (Task Success Rate): GTA1-7B reaches 45.2%, outperforming OpenAI CUA (42.9%) and Claude 3.7 (28.0%).
ScreenSpot-Pro (Grounding Accuracy): GTA1-7B scores 50.1%, ahead of models like UGround-72B (34.5%).
ScreenSpot-V2 (Cross-platform Grounding): GTA1-72B hits 94.8%, nearly matching the top proprietary models.
OSWorld-G (Linux GUI Grounding): GTA1-7B reaches 67.7%, outperforming all prior open-source approaches.

These results validate the effectiveness of both the planning and grounding innovations introduced in GTA1.

Additional Design Highlights

Data Cleaning: Misaligned annotations from datasets like Aria-UI and OS-Atlas are filtered using OmniParser to improve training signal fidelity.
Model Scaling: The approach scales well across models from 7B to 72B parameters, with GTA1-7B offering the best trade-off between performance and compute.
Judge Reusability: The multimodal judge used in test-time scaling can be the same LLM used for planning, reducing overhead.

Conclusion

GTA1 demonstrates that robust and accurate GUI agents can be built using a modular two-stage framework enhanced by test-time planning diversity and precise RL-based grounding. By forgoing unnecessary complexity—such as chain-of-thought reasoning in static tasks—Salesforce AI has introduced a lean, effective agent architecture that pushes the frontier in open-ended digital interaction.

Check out the Paper, Codes, 7B Model, 32B Model and 72B Model. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.