WebBench: Browser Agent Benchmarks

6 hours ago 2

WebBench is an open, task-oriented benchmark designed to measure how effectively browser agents handle complex, realistic web workflows. It includes 2,454 tasks across 452 live websites selected from the global top-1000 by traffic. Dataset Card: HuggingFace

Last updated: May 28, 2025

Web browsing agents have rapidly evolved, with numerous new entrants claiming state-of-the-art performance. However, we find that even advanced browser agents frequently struggle with real-world tasks, particularly due to the inherently adversarial nature of the web, challenges with authentication, form-filling inefficiencies, and difficulties with file downloads.

We developed Web Bench to systematically quantify and address these performance gaps, expanding on the foundational work introduced by WebVoyager.

Significant Expansion: Increased website coverage from 15 → 452 and tasks from 642 → 5,750.
Task Type Differentiation: Clearly defined READ vs WRITE tasks.
- READ: Navigation and data retrieval.
- WRITE: Data input, authentication, file operations, and solving 2FA challenges—areas notably underrepresented in previous benchmarks.
Infrastructure Impact Measurement: Explicit consideration of browser infrastructure complexities (e.g., CAPTCHA solving, direct website interactions).

Category Description Example Count (% of dataset)

READ	Tasks involving navigation and data extraction	“Navigate to the news section and summarize the headline and key points from the latest science policy update.”	1580 (64.4%)
CREATE	Tasks involving data creation on websites	“Log in to your account and create a new board titled "Summer 2024" in the "Wish List" section, then add a "sleeveless midi dress" from the Women’s category to it.”	512 (20.9%)
UPDATE	Tasks requiring data updates	“Adjust your journal notification preferences in your Springer account to receive email updates about "Online First" articles instead of the default monthly summary.”	173 (7.1%)
DELETE	Tasks requiring data deletion	“Log in to your account, create a temporary test question in the Academia community titled "Test Question for Deletion," then delete it and confirm its removal from your profile.”	149 (6.1%)
FILE_MANIPULATION	Tasks involving file downloads	“Locate a downloadable recipe printout for a popular dessert recipe, download the file, and verify that the filename includes the recipe’s name.”	40 (1.5%)

Benchmarking: Systematically compare different agent architectures.
Ablation & Debugging: Identify specific failure points (DOM changes, pop-ups, authentication hurdles).
Rapid Prototyping: Quickly validate improvements under realistic web scenarios.

Benchmarking upcoming browser agents including Claude 4, Operator O3, UI-TARs, and Mariner API.
Extending coverage beyond the top 1000 global websites.
Incorporating multilingual tasks to evaluate agent performance in various languages.