New SOTA Web Agent beats even Operator with human intervention

5 hours ago 3

1. The Rise of AI Web Agents and the Need for Robust Benchmarks

AI web agents are transforming digital interaction by automating complex online tasks. However, the web's dynamic nature poses significant challenges for evaluating their performance reliably. This has created a critical demand for robust, standardized benchmarks like Halluminate's Web Bench, ensuring verifiable capabilities and accelerating trust in this emerging technology.

2. Understanding Halluminate's Web Bench: A New Standard for Agent Evaluation

Halluminate's Web Bench offers a rigorous, comprehensive standard for evaluating AI browser agents by distinguishing between "READ" and "WRITE" tasks. Explore its methodology and current results at halluminate.ai/blog/benchmark.

3. Introducing rtrvr.ai: The Local-First, DOM-Powered Agent

rtrvr.ai distinguishes itself in the crowded field of AI web agents through a fundamental architectural difference: its commitment to local operation. Unlike many leading agents that rely on remote cloud browsers, rtrvr.ai operates directly within the user's own browser on their device as a Chrome Extension. This design philosophy offers a suite of significant advantages that directly address common pain points in web automation.

By originating network requests from the user's local IP address, rtrvr.ai largely bypasses prevalent bot detection mechanisms and CAPTCHA challenges that frequently impede cloud-based agents. Furthermore, it avoids the use of Debugger permissions, which can expose browsers to exploits and trigger bot detection, and crucially, it can reuse local signed-in profiles and subscriptions. This means rtrvr.ai can seamlessly interact with authenticated content, such as signed-in Instagram accounts or paywalled articles, without requiring users to share sensitive credentials with a third-party provider, a capability often blocked for cloud browsers.

Key Architectural Advantages

  • Local Operation: Bypasses bot detection and CAPTCHA challenges
  • DOM-Based Approach: Direct HTML structure interaction for robust understanding
  • Multi-Tab Workflows: Parallel task execution in background tabs
  • Authenticated Access: Reuses local signed-in profiles and subscriptions
  • AI Function Calling: Custom code integration capability
  • Speed & Cost: More than 7x faster than the next fastest alternative, Browser Use. Total Cost= $40/4000 credits of rtrvr.ai

Beyond its local operational paradigm, rtrvr.ai employs a Document Object Model (DOM)-based approach to "see" and interact with web pages. Rather than relying solely on visual cues or screenshots, rtrvr.ai leverages the underlying HTML structure of a webpage, providing a deeper and more robust understanding of content and elements. This technical foundation enables several powerful capabilities: it allows for seamless operation in background tabs, facilitating powerful multi-tab workflows and efficient parallel task execution; it yields highly accurate scrapers by directly interacting with the page's structure; and it accelerates agentic workflows by utilizing full DOM information to skip redundant actions and distribute subtasks onto new tabs.

Key Observation on Vision-Based Agents

During our evaluation, we noted that rtrvr.ai performed well even when common web elements like pop-ups and overlays appeared. rtrvr.ai was able to close these or simply perform its action despite them. This contrasts sharply with many CUA (Computer Vision-based UI Automation) or vision-based agents, which often struggle with such elements. For vision agents, a pop-up can completely obscure the underlying webpage, requiring the agent to first identify and close the pop-up before it can even "see" and interact with the intended content. This fundamental difference in how rtrvr.ai (DOM-based) perceives and interacts with the web provides a significant advantage in handling dynamic and sometimes "noisy" web environments.

A particularly compelling benefit of this DOM-based approach is its ability to mitigate the "exponential failure rate" problem inherent in multi-step web automation. In complex workflows, the probability of overall success can decrease dramatically with each additional step if individual steps have independent failure rates. By parallelizing steps across multiple tabs, rtrvr.ai fundamentally re-architects this problem, making sophisticated, multi-step tasks significantly more robust and reliable. This is a profound technical advantage for enterprise-level automation, where complex sequences of actions are common.

rtrvr.ai further empowers users through its "AI Function Calling" capability, allowing them to define and supply their own custom code or functions that the AI agent can autonomously invoke. This feature provides immense flexibility and extensibility, enabling users to tailor the agent's capabilities to virtually any external tool, API, or custom workflow, effectively shifting control and customization power to the user. From a practical standpoint, rtrvr.ai operates as a client-side Chrome extension, vetted and tested by Google, ensuring a secure and sandboxed execution environment.

4. rtrvr.ai's Performance on the Halluminate Web Bench

rtrvr.ai's performance on the Halluminate Web Bench sets a new benchmark for AI web agent capabilities, demonstrating a remarkable 81.79% overall success rate that significantly surpasses other leading models. Access the complete data here.

Overall Success: A Comprehensive Look

On the Halluminate Web Bench, rtrvr.ai achieved an impressive 81.79% overall success rate, placing it at the forefront of evaluated agents and significantly outperforming models from prominent organizations. For context, the previously top-ranked Anthropic Sonnet 3.7 CUA achieved 66.0%, followed by Skyvern 2.0 at 64.4%, and OpenAI CUA at 59.8%. rtrvr.ai even beats Open AI Operator with Human Supervision at 76.5%. This performance suggests a redefinition of the current state-of-the-art for web agents on this specific benchmark.

RankModelOrganizationScore (%)CategoryDate
🥇 #1rtrvr.airtrvr.ai81.79%OVERALL2025-06-20
🥈 #2Anthropic Sonnet 3.7 CUAAnthropic66.0%OVERALL2025-01-15
🥉 #3Skyvern 2.0Open Source64.4%OVERALL2025-01-15
#4Skyvern 2.0 on BrowserbaseOpen Source60.7%OVERALL2025-01-15
#5OpenAI CUAOpenAI59.8%OVERALL2025-01-15
#6Browser Use CloudOpen Source43.9%OVERALL2025-01-10
#7Convergence AIConvergence39.9%OVERALL2025-01-10

Halluminate Web Bench - Overall Performance Leaderboard

rtrvr.ai leads with 81.79% success rate, significantly outperforming all other AI web agents

100%75%50%25%0%

76.5%

OpenAI Operator with Human Supervisor

66%

Anthropic Sonnet 3.7 CUA

:

66%

60.7%

Skyvern 2.0 on Browserbase

:

60.7%

Skyvern 2.0 on Browserbase

Unparalleled Speed: Agent Runtime Duration

Beyond accuracy, the speed at which an AI agent completes tasks is a crucial performance metric, especially for latency-sensitive applications. Our evaluation showed that rtrvr.ai achieved an average execution time of 0.9 minutes (less than 1 minute) per task, making rtrvr.ai the fastest of all agents observed in this context.

This dramatically outperforms other agents benchmarked by Halluminate, which show average execution times ranging from several minutes to over twenty minutes per task:

Agent Execution Speed Comparison (Minutes per Task)

rtrvr.ai completes tasks in under 1 minute, 7X faster than the leading competitor

21min16min11min6min0min

6.35min

Browser Use Cloud

:

6.35 minutes

11.81min

Anthropic Sonnet 3.7 CUA

:

11.81 minutes

12.49min

Skyvern 2.0

:

12.49 minutes

20.84min

Skyvern 2.0 on Browserbase

:

20.84 minutes

Skyvern 2.0 on Browserbase

Speed Comparison Breakdown:

  • rtrvr.ai: 0.9 minutes
  • Browser Use Cloud: 6.35 minutes
  • OpenAI CUA: 10.1 minutes
  • Anthropic Sonnet 3.7 CUA: 11.81 minutes
  • Skyvern 2.0: 12.49 minutes
  • Skyvern 2.0 on Browserbase: 20.84 minutes

This exceptional speed is a direct benefit of rtrvr.ai's local, DOM-based architecture. By operating as a Chrome Extension, it largely avoids network latencies associated with remote cloud browsers and leverages direct, efficient interaction with the DOM. This allows it to navigate and process information much faster than vision-based or remote alternatives, reinforcing its previously reported capability of being up to 4x faster than alternatives like OpenAI Operator.

Deep Dive: Excelling in Read Tasks

rtrvr.ai demonstrated exceptional proficiency in read-heavy tasks, achieving an 89.45% success rate. This performance underscores rtrvr.ai's strength in accurate data extraction and information retrieval, which are core capabilities for numerous automation needs. This high success rate directly correlates with rtrvr.ai's DOM-based approach, which enables it to "see" and interpret the underlying structure of web pages with high fidelity.

RankModelOrganizationScore (%)CategoryDate
🥇 #1rtrvr.airtrvr.ai89.45%READ2025-06-20
🥈 #2Anthropic Sonnet 3.7 CUAAnthropic80.6%READ2025-01-15
🥉 #3Skyvern 2.0 on BrowserbaseOpen Source75.6%READ2025-01-15
#4OpenAI CUAOpenAI75.0%READ2025-01-15
#5Skyvern 2.0Open Source74.2%READ2025-01-15
#6Browser Use CloudOpen Source63.2%READ2025-01-10
#7Convergence AIConvergence51.8%READ2025-01-10

Read Task Performance Comparison

rtrvr.ai dominates read tasks with 89.45% success rate, showcasing superior data extraction capabilities

100%75%50%25%0%

80.6%

Anthropic Sonnet 3.7 CUA

:

80.6%

75.6%

Skyvern 2.0 on Browserbase

:

75.6%

Skyvern 2.0 on Browserbase

Deep Dive: Navigating Write Tasks

While write-heavy tasks are generally more challenging for all AI agents due to their interactive nature and need for precise state management, rtrvr.ai demonstrated strong performance with a 62.89% success rate. This score is notably higher than other agents on the benchmark. For instance, Skyvern 2.0, the next highest, achieved 46.6%, and Anthropic Sonnet 3.7 CUA scored 39.4%. This lead is significant, especially considering the general industry trend of lower performance on write tasks.

RankModelOrganizationScore (%)CategoryDate
🥇 #1rtrvr.airtrvr.ai62.89%WRITE2025-06-20
🥈 #2Skyvern 2.0Open Source46.6%WRITE2025-01-15
🥉 #3Anthropic Sonnet 3.7 CUAAnthropic39.4%WRITE2025-01-15
#4Skyvern 2.0 on BrowserbaseOpen Source33.6%WRITE2025-01-15
#5OpenAI CUAOpenAI32.3%WRITE2025-01-15
#6Convergence AIConvergence13.1%WRITE2025-01-10
#7Browser Use CloudOpen Source11.4%WRITE2025-01-10

Write Task Performance Comparison

rtrvr.ai leads in challenging write tasks with 62.89% success rate, demonstrating superior interactive capabilities

100%75%50%25%0%

39.4%

Anthropic Sonnet 3.7 CUA

:

39.4%

33.6%

Skyvern 2.0 on Browserbase

:

33.6%

Skyvern 2.0 on Browserbase

Our Evaluation Methodology and Key Learnings

Our evaluation of rtrvr.ai on the Halluminate Web Bench involved specific setup procedures and yielded valuable insights into the agent's capabilities and areas for future improvement:

Evaluation Setup:

  • Security First: Credit cards were locked before evaluation to prevent unintended transactions during automated interactions.
  • Pre-registered Accounts: Tasks assumed the agent was already logged into necessary accounts, with pre-registration removing the initial login hurdle from benchmark scope.
  • Streamlined Task Management: rtrvr.ai's inherent capability to ingest tasks and URLs directly from spreadsheet formats made benchmark setup remarkably easy, requiring minimal adjustments to capture success rates and time metrics.

Key Learnings & Observations:

  • Iterative Improvement ("Hill Climb"): We will "hill climbing" on the identified failure cases and expect dramatically better performance on future runs!
  • Agent's Tool Use (Googling): Despite task goals being confined to specific website navigation, the agent occasionally resorted to Googling, which we counted as valid due to rtrvr.ai's robust URL navigation capabilities.
  • Networking and Posting Limits: Certain websites exhibited aggressive limits, occasionally flagging IP addresses, pointing to requirements for distributed testing setups or rotating IPs.

Agent Interaction Quirks:

  • Aggressive Scrolling: The agent sometimes exhibited aggressive scrolling behavior, which could occasionally lead to elements being missed or page load issues.
  • No Hover Action: A current limitation was the lack of a "hover" action, preventing interaction with dynamic UI elements that rely on hover states to reveal sub-menus or information.
  • Dropdown Bugs: Specific bugs related to dropdown menus involving typing into fields, clicking to reveal options, and selecting from lists proved challenging for multi-step interactions.
  • Feedback Mechanism Differences: While rtrvr.ai can ask for human feedback in other workflows, the spreadsheet-driven benchmark workflow didn't incorporate such feedback loops.
  • Crawl Functionality Limits: Multi-tab processing was purposefully limited during benchmarking to maintain consistency, though real-world scenarios could leverage this for even better results.

Cost-Effectiveness Analysis:

For this evaluation, we used Google Gemini's "Flash" model for speed and price optimization. While the "Pro" model might offer even better results, the Flash model delivered exceptional performance at remarkable cost efficiency:

  • Total Tasks: 323 tasks
  • Total Cost: 4,000 credits (~$40)
  • Cost per Task: ~$0.12

In comparison, Halluminate reported testing costs of ~$3,000 per agent with human annotators.

Notes & Comments on the Web Bench Design

While Halluminate's Web Bench is a significant step forward, our evaluation highlighted several considerations for future benchmark iterations:

  • Language Limitations: The current benchmark lacks tasks involving foreign language sites, limiting applicability for global use cases and agent multilingual capabilities.
  • Real-World Relevance: There's a disconnect between "top human visited websites" and actual websites where users would most prefer AI agents. Benchmarks should evolve to measure utility for actual user needs.
  • Task Design and Agent Tooling:We observed that rtrvr.ai occasionally used web search (Googled) even when tasks seemed to imply direct website navigation. While rtrvr.ai successfully completed these by leveraging its internal tools for URL navigation, this highlights an opportunity for benchmark design. Instead of simply restricting tasks, future benchmarks could be designed to be more complex and open-ended, explicitly encouraging agents to utilize their full suite of tools—including web search and advanced URL parsing—to find information or navigate to destinations. This would better reflect real-world user scenarios where agents are expected to be resourceful and demonstrate comprehensive problem-solving.
  • Infrastructure Management: Running the benchmark on our personal machines resulted in IP addresses being flagged for "spammy content" due to high request volumes, highlighting the need for professional alternatives such as Halluminate that uses rotating IPs/Human Annotators.
  • Testing Interleaving: Websites should be interleaved during testing to reduce site-specific blocks and rate limits.

📹 rtrvr.ai Web Bench Performance Demonstration

Watch the rtrvr.ai eval playlist to validate the benchmark performance yourself!

5. Comparing rtrvr.ai to the Field: A Leader in Its Own Right

rtrvr.ai's performance on the Halluminate Web Bench is not merely a quantitative win; it represents a qualitative validation of its unique architectural approach. The consistently superior scores across all categories, especially when contrasted with agents that rely on cloud-based infrastructure, highlight a fundamental advantage derived from its local operation and DOM-based design.

Most AI web agents operating in cloud environments frequently encounter "Infrastructure issues" such as bot detection, CAPTCHA challenges, and difficulties with authenticated logins. These external barriers can significantly impede an agent's ability to complete tasks, regardless of its underlying AI intelligence. rtrvr.ai's local operation directly addresses these common pitfalls by functioning within the user's browser and leveraging their local IP address and existing signed-in sessions, inherently circumventing many of these obstacles.

Furthermore, rtrvr.ai's lead in the challenging "write task" category is particularly noteworthy. While the industry generally observes lower performance for write-heavy tasks due to their inherent complexity and the need for precise interaction and state management, rtrvr.ai's significantly higher success rate demonstrates its advanced capabilities in these interactive scenarios. This strong performance in a notoriously difficult area underscores the robustness of its DOM-based approach, which aids in navigating complex web elements and managing multi-step processes more effectively.

A crucial qualitative aspect of rtrvr.ai's performance lies in the nature of its failures. The overwhelming majority of rtrvr.ai's errors are "agent errors" rather than "infrastructure errors." This distinction is profoundly important for development and product teams. An "agent error" signifies that the AI itself failed to understand or execute a task, a problem that can be directly addressed through improvements to the model, refinement of prompt structures, or enhancement of its internal logic. Conversely, an "infrastructure error," such as being blocked by a CAPTCHA or IP detection, is often external to the agent's core intelligence and much harder for the agent developer to control.

6. Unpacking Failure Modes: Insights from rtrvr.ai's Performance

A critical component of the Halluminate Web Bench evaluation is the detailed breakdown of failure modes, which provides granular insight into where and why agents falter. For rtrvr.ai, this analysis reveals a highly favorable distribution of errors, underscoring the effectiveness of its unique architectural design.

The Breakdown: Agent vs. Infrastructure Errors

rtrvr.ai's failure analysis shows a striking imbalance between the two primary categories of errors:

Agent Errors

Internal AI logic and execution issues

94.74%

These failures can be directly addressed through AI improvements

Infrastructure Errors

External blocking and access issues

5.26%

Remarkably low due to local operation architecture

This extremely low percentage of infrastructure errors is a direct and powerful testament to rtrvr.ai's local, browser-extension design. Unlike cloud-based agents that frequently encounter obstacles such as bot detection, CAPTCHAs, and login authentication issues due to their remote nature, rtrvr.ai's operation within the user's own browser effectively bypasses these common external barriers.

Focusing on Agent-Specific Improvements

The overwhelming majority of rtrvr.ai's failures fall into the "agent error" category. These types of failures typically relate to the AI's internal logic, such as issues with picking the right web trajectory for some niche tasks, fewer/missing tools to interact with tail-end websites, incomplete execution of tasks, sometimes can be attributed to less intelligent Gemini Flash Model used unlike Pro, more intelligent for complex tasks, and timeouts due to inefficient processing or model unavailability. rtrvr.ai doesn't use any reserved quota yet. While any failure indicates an area for improvement, having nearly all failures attributable to the agent itself is, paradoxically, a desirable characteristic for an AI agent.

This distribution means that rtrvr.ai's development team can concentrate almost entirely on enhancing the core AI's intelligence, reasoning, and robustness. Efforts can be precisely targeted at refining prompt structures, improving model configurations, and bolstering the agent's ability to understand and interact with diverse web elements. This contrasts sharply with agents that face a high proportion of infrastructure errors, where development resources must often be diverted to managing external factors like proxy rotations or CAPTCHA solving services.

7. The rtrvr.ai Advantage: Why Local and DOM-Based Matters

rtrvr.ai's exceptional performance on the Halluminate Web Bench is a direct consequence of its foundational design principles: local operation and a DOM-based approach. These are not merely technical specifications but strategic choices that translate into profound real-world advantages for users and businesses.

The local operation of rtrvr.ai, running as a Chrome Extension directly within the user's browser, is a game-changer for web automation. This architecture inherently bypasses the pervasive challenges of bot detection and CAPTCHAs, which are common hurdles for cloud-based agents. By originating network requests from the user's local IP address and reusing existing signed-in profiles and subscriptions, rtrvr.ai gains unparalleled access to authenticated and paywalled content that other agents struggle with.

Complementing this, rtrvr.ai's DOM-based approach provides a deeper, more robust understanding of web pages compared to agents that rely on visual parsing. By interacting directly with the Document Object Model, rtrvr.ai can operate seamlessly in background tabs, enabling efficient multi-tab workflows and parallel task execution. This technical capability is crucial for achieving accurate data scraping and faster agentic workflows.

  • Bypasses bot detection and CAPTCHAs
  • Reuses signed-in profiles and subscriptions
  • Enhanced security and privacy
  • Up to 7x faster performance
  • Cost-effective at <$0.12 per task on avg (including any re-runs for crashes/overheating personal device)
  • 0.9 minute average task completion
  • Deeper web page understanding
  • Seamless background tab operation
  • Efficient multi-tab workflows
  • Highly accurate data scraping
  • Collapses exponential failure rates
  • Handles pop-ups and overlays effectively

8. Conclusion & The Future of Web Automation with rtrvr.ai

The Halluminate Web Bench has provided a critical, standardized framework for evaluating the burgeoning field of AI web agents, and rtrvr.ai's performance on this benchmark sets a new standard for the industry. With an impressive 81.79% overall success rate, rtrvr.ai has not only surpassed leading models from established organizations but has also demonstrated exceptional proficiency in both read-heavy tasks (89.45% success) and the notoriously challenging write-heavy tasks (62.89% success). Crucially, its average task execution time of 0.9 minutes positions it as the fastest among its peers, offering a significant advantage for latency-sensitive applications.

The core of rtrvr.ai's advantage lies in its unique architectural design: a local, browser-extension operation combined with a DOM-based approach. This design fundamentally addresses the pervasive "infrastructure issues" that plague many cloud-based agents, as evidenced by rtrvr.ai's remarkably low 5.26% infrastructure error rate. This means rtrvr.ai is less susceptible to bot detection, CAPTCHAs, and login challenges, ensuring consistent access and reliable task execution.

The overwhelming majority of its failures are attributable to "agent errors," which, while requiring attention, represent a "good problem" for development teams. This allows rtrvr.ai to focus its efforts almost entirely on enhancing the core AI's intelligence, reasoning, and execution capabilities, rather than expending resources on external environmental workarounds. This systematic evaluation provides a clear roadmap for future development, ensuring that product enhancements are data-driven and yield the most significant improvements in user experience.

rtrvr.ai's performance on the Halluminate Web Bench is not just a testament to its current capabilities but also a strong indicator of its potential to lead the next generation of AI web agents. Its ability to operate reliably within the user's environment, handle complex multi-step workflows, and integrate with personalized, authenticated content positions it as a powerful, secure, and efficient solution for a wide array of real-world web automation needs. As the demand for robust and dependable AI automation continues to grow, rtrvr.ai's proven architectural advantages and benchmark-validated performance make it a compelling choice for individuals and enterprises seeking to unlock the full potential of autonomous web interaction.

Make your browser self-driving

Ready to transform your online workflows with the industry's leading AI web agent?

Try rtrvr.ai Today →

Works Cited (accessed June 20, 2025)

  1. Evaluating AI agent applications - Wandb, https://wandb.ai/site/wp-content/uploads/2025/02/Evaluations-whitepaper.pdf
  2. What is Mosaic AI Agent Evaluation (legacy)? - Databricks Documentation, https://docs.databricks.com/aws/en/generative-ai/agent-evaluation/
  3. Web Bench: The Current State of Browser Agents - Halluminate, https://halluminate.ai/blog/benchmark
  4. Web Bench - A new way to compare AI Browser Agents - Skyvern Blog, https://blog.skyvern.com/web-bench-a-new-way-to-compare-ai-browser-agents/
  5. HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection, https://arxiv.org/html/2505.00506v1
  6. An AI Web Agent Deep Comparison: rtrvr.ai vs. OpenAI Operator vs ..., https://www.rtrvr.ai/blog/ai-web-agents-deep-comparison
Read Entire Article