WebBench is an open, task-oriented benchmark designed to measure how effectively browser agents handle complex, realistic web workflows. It includes 2,454 tasks across 452 live websites selected from the global top-1000 by traffic. Dataset Card: HuggingFace
Last updated: May 28, 2025
Web browsing agents have rapidly evolved, with numerous new entrants claiming state-of-the-art performance. However, we find that even advanced browser agents frequently struggle with real-world tasks, particularly due to the inherently adversarial nature of the web, challenges with authentication, form-filling inefficiencies, and difficulties with file downloads.
We developed Web Bench to systematically quantify and address these performance gaps, expanding on the foundational work introduced by WebVoyager.
-
Significant Expansion: Increased website coverage from 15 → 452 and tasks from 642 → 5,750.
-
Task Type Differentiation: Clearly defined READ vs WRITE tasks.
- READ: Navigation and data retrieval.
- WRITE: Data input, authentication, file operations, and solving 2FA challenges—areas notably underrepresented in previous benchmarks.
-
Infrastructure Impact Measurement: Explicit consideration of browser infrastructure complexities (e.g., CAPTCHA solving, direct website interactions).
READ | Tasks involving navigation and data extraction | “Navigate to the news section and summarize the headline and key points from the latest science policy update.” | 1580 (64.4%) |
CREATE | Tasks involving data creation on websites | “Log in to your account and create a new board titled "Summer 2024" in the "Wish List" section, then add a "sleeveless midi dress" from the Women’s category to it.” | 512 (20.9%) |
UPDATE | Tasks requiring data updates | “Adjust your journal notification preferences in your Springer account to receive email updates about "Online First" articles instead of the default monthly summary.” | 173 (7.1%) |
DELETE | Tasks requiring data deletion | “Log in to your account, create a temporary test question in the Academia community titled "Test Question for Deletion," then delete it and confirm its removal from your profile.” | 149 (6.1%) |
FILE_MANIPULATION | Tasks involving file downloads | “Locate a downloadable recipe printout for a popular dessert recipe, download the file, and verify that the filename includes the recipe’s name.” | 40 (1.5%) |
- Benchmarking: Systematically compare different agent architectures.
- Ablation & Debugging: Identify specific failure points (DOM changes, pop-ups, authentication hurdles).
- Rapid Prototyping: Quickly validate improvements under realistic web scenarios.
- Benchmarking upcoming browser agents including Claude 4, Operator O3, UI-TARs, and Mariner API.
- Extending coverage beyond the top 1000 global websites.
- Incorporating multilingual tasks to evaluate agent performance in various languages.
- Official Leaderboard: WebBench Leaderboard
- Dataset Card: HuggingFace
- Technical Report: Halluminate Technical Report
- Launch Announcement: In partnership with Skyvern
We welcome contributions—new tasks, evaluation scripts, or bug reports.
To benchmark your browser agent, please contact: [email protected]