SWE-rebench: Over 21,000 Open Tasks for SWE LLMs

6 days ago 2

SWE-rebench is a large-scale dataset designed to support training and evaluation of LLM-based software engineering (SWE) agents, building upon and expanding our earlier release, SWE-bench-extra. It is constructed using a fully automated pipeline that continuously extracts real-world interactive SWE tasks from GitHub repositories at scale, as detailed in our paper SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents. The dataset currently comprises over 21,000 issue–pull request pairs from 3,400+ Python repositories, each validated for correctness through automated environment setup and test execution. A curated subset of these tasks also forms the basis of our continuously updated SWE-rebench leaderboard. SWE-rebench builds upon and extends the methodology of SWE-bench by incorporating several key enhancements detailed in our paper, including:

A fully automated pipeline for continuous task collection.
LLM-driven extraction and validation of environment installation instructions.
An automated LLM-based task quality assessment pipeline that annotates tasks with labels such as clarity, complexity, or test patch validity.

from datasets import load_dataset ds = load_dataset('nebius/SWE-rebench')

The SWE-rebench dataset schema extends the original SWE-bench schema with additional fields to support richer analysis. The complete schema is detailed in the table below. For more information about this data and methodology behind collecting it, please refer to our paper.

Field name Type Description

instance_id	str	A formatted instance identifier, usually as repo_owner__repo_name-PR-number.
patch	str	The gold patch, the patch generated by the PR (minus test-related code), that resolved the issue.
repo	str	The repository owner/name identifier from GitHub.
base_commit	str	The commit hash of the repository representing the HEAD of the repository before the solution PR is applied.
hints_text	str	Comments made on the issue prior to the creation of the solution PR’s first commit creation date.
created_at	str	The creation date of the pull request.
test_patch	str	A test-file patch that was contributed by the solution PR.
problem_statement	str	The issue title and body.
version	str	Installation version to use for running evaluation.
environment_setup_commit	str	Commit hash to use for environment setup and installation.
FAIL_TO_PASS	str	A JSON list of strings that represent the set of tests resolved by the PR and tied to the issue resolution.
PASS_TO_PASS	str	A JSON list of strings that represent tests that should pass before and after the PR application.
meta	str	A JSON dictionary indicating whether the instance is lite, along with a list of failed lite validators if it is not.
license_name	str	The type of license of the repository.
install_config	str	Installation configuration for setting up the repository.
requirements	str	Freezed requirements for the repository.
environment	str	Environment configuration for the repository.

To execute tasks from SWE-rebench (i.e., set up their environments, apply patches, and run tests), we provide a fork of the original SWE-bench execution framework, adapted for our dataset's structure and features.

The dataset is licensed under the Creative Commons Attribution 4.0 license. However, please respect the license of each specific repository on which a particular instance is based. To facilitate this, the license of each repository at the time of the commit is provided for every instance.

@misc{badertdinov2025swerebenchautomatedpipelinetask, title={SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents}, author={Ibragim Badertdinov and Alexander Golubev and Maksim Nekrashevich and Anton Shevtsov and Simon Karasik and Andrei Andriushchenko and Maria Trofimova and Daria Litvintseva and Boris Yangel}, year={2025}, eprint={2505.20411}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2505.20411} }

Downloads last month286

Read Entire Article