Code Review Benchmark

2 hours ago 2

tl/dr: The results of our benchmark indicate Macroscope achieved the highest detection rate of bugs from our dataset (48%). CodeRabbit and Cursor Bugbot were close behind (46% and 42% respectively), followed by Greptile (24%) and Graphite Diamond (18%)

To benchmark the effectiveness of Macroscope’s code review, we constructed a dataset of real-world production bugs drawn from open-source repositories. We then evaluated Macroscope and other AI code review tools (Graphite Diamond, Cursor Bugbot, Greptile and CodeRabbit) against that dataset. The results of our benchmark showed that Macroscope achieved the highest bug detection rate from our dataset of known bugs.

Dataset Methodology

We assembled our dataset with the following process:

We selected 45 popular open-source repositories¹ across the 8 programming languages Macroscope supports (Go, Java, JavaScript, Kotlin, Python, Rust, Swift, TypeScript). We searched their commit logs to find commits labelled as bug fixes.
An LLM classified each commit as either a self-contained issue² or a context-dependent issue, and either a runtime issue³ or a subjective code-quality issue (e.g. adding documentation, styling, linting). We kept only the self-contained runtime bugs, since Macroscope Code Review is designed specifically to detect this type of issue. Given that the other tools describe themselves as being capable of detecting self-contained runtime bugs, focusing on this type of issue allowed for an apples-to-apples evaluation.
For each bug-fix commit, an LLM analyzed the original commit message and patch to generate a description of the bug that was fixed.
We used git blame data to attempt to identify the commit that introduced the bug and used an LLM to try to verify whether the commit identified was indeed the source.
We manually reviewed a subset of the dataset and removed any samples where the dataset appeared to be incorrect.

This process ultimately yielded a dataset of 118 runtime bugs across the 45 repositories we sampled. Each sample in our dataset contains the commit hash that we believe introduced the bug, the commit hash that fixed the bug, the patch that fixed the bug, and an LLM-generated description of the bug. Here’s a sample of one bug from the dataset:

We identified this bug-fix commit:

Commit message: fixed overflow error in gdc computation
Commit Hash: dabf3a5beb9ab697d570154b9961078a8586c787
Repository: https://github.com/apache/commons-math

We identified this commit as the commit that introduced the bug:

Commit Hash: 746892442f75845426e16f258d42498ad1de154b

A LLM generated this description of the bug that was fixed:

Description: MathUtils.gcd incorrectly detects zero operands by testing u * v == 0, which can overflow to zero for non-zero inputs (e.g., 65536 × 65536), causing the method to take the zero-case path and return |u| + |v| instead of the correct GCD.

Dataset Bugs per Language

Evaluation Methodology

Once we assembled the dataset of known bugs, we evaluated whether different code review tools detected those bugs. For each bug in our dataset:

We forked its repository and created two branches: one at the parent of the commit that introduced the bug (to simulate the state of the codebase before the bug was introduced) and another branch at the bug commit itself (to simulate the state of the codebase when the bug was introduced).
We opened a pull request with the base set to the parent commit branch and the head set to the branch with the bug commit. We did this so that the PR diff would be identical to the commit we believed introduced the bug.
Each tool ran with its default settings, and ran in isolation within independent GitHub PRs in order to avoid scenarios where one tool might skip reporting a correctly identified bug because another tool already flagged it. Whenever a PR review failed or was rate-limited, we attempted to re-run.
After the reviews were complete, an LLM evaluated whether any review comments from each tool matched the bug descriptions generated during dataset assembly.
We manually verified all of the matched bugs identified for each tool. We also randomly sampled and verified five non-matches per tool to ensure that the LLM was not misclassifying cases in either direction. Full manual QA of all results was impractical due to the sheer number of aggregate review comments left across all tools, so sampling was used instead.

Due to issues with rate limits and availability, we were not able to have a consistent sample size of bugs across all tools (see this table for the number of bugs evaluated per tool). Most notably, our access to Greptile’s code review functionality was disabled midway through our evaluation. We did not include any results from Pull Requests where the tool did not explicitly indicate within the GitHub PR that the review was completed successfully.

Results

We analyzed our results across the following dimensions:

Known Bug Detection Rate - The percentage of known bugs from the dataset that each code review tool correctly identified in a review comment⁴.
Price - The standard monthly per-seat price of each code review tool.
Comment Volume - The average number of review comments left per pull request by each tool.

Known Bug Detection Rate

The results of our benchmark indicate Macroscope achieved the highest detection rate of bugs from our dataset (48%). CodeRabbit and Cursor Bugbot were close behind (46% and 42% respectively), followed by Greptile (24%) and Graphite Diamond (18%).

Detection rates varied across languages. Macroscope had the highest bug detection rates for Go (86%), Java (56%), Python (50%), and Swift (36%), while CodeRabbit had the highest bug detection rates for JavaScript (59%) and Rust (45%). For Kotlin, CodeRabbit and Macroscope tied for the highest detection rate (50%), and for TypeScript, CodeRabbit and Cursor Bugbot tied for highest (36%).

Bug Detection Rate by Language

Comment Volume

To evaluate how “loud” or “quiet” each tool is during code reviews, we measured the number of comments left on each pull request we opened. We did not assess the quality, value or correctness of all of these comments. We simply quantified the average number of review comments left per opened PR, so that we could illustrate a normalized volume meter for how “loud” each tool was on its default settings.

Macroscope’s average comment volume ranked in the middle of the five tools we evaluated (all on default settings), while CodeRabbit was significantly the “loudest” and Graphite Diamond was the “quietest”.

We then narrowed our analysis to focus specifically on review comments that identified runtime issues from our dataset. By excluding style and documentation comments, we were able to do a closer like-for-like comparison of the volume of runtime-related comments across tools. Since each tool categorizes issues differently, we aligned categories⁵ as best as we could.

Within this subset of review comments that flagged runtime issues, CodeRabbit still generated the highest average volume of comments (i.e. the “loudest”), while Graphite Diamond generated the least (i.e. the “quietest”). Macroscope again fell between these extremes, generating fewer average runtime-related comments than CodeRabbit, but more than Greptile, Cursor Bugbot, and Graphite Diamond.

Runtime-Related Comments per PR

Pricing

For this benchmark, we used the minimum plans⁶ — both in functionality and term (e.g. monthly) — that allowed us to run the evaluation.

Comparing pricing across code review tools is challenging because the features and functionality offered by each tool aren’t directly comparable. Some tools (e.g. Greptile, CodeRabbit) offer customizable rules and configuration options to detect issues beyond runtime issues. In Macroscope’s case, we provide a suite of features beyond AI code review that deliver value for leaders and product stakeholders.

Price vs. Bug Detection Rate

Evaluation Results

Evaluation Results Summary Table

We plan to continue running internal benchmarking to make sure that we’re delivering the best code review experience for our customers.

Limitations

As with any benchmark evaluation, this evaluation has limitations to consider when interpreting the results:

We only evaluated detection of self contained runtime bugs, and did not assess detection of other types of issues such as style issues, compilation issues etc.
All tools ran with their default settings (e.g. we did not utilize custom rules, which some tools support). For each tool, we used their minimum plan (i.e. in terms of functionality and term) that enabled us to run our benchmark evaluation.
Due to issues with rate limits and availability, we were not able to have a consistent sample size of bugs across all tools (see this table for the number of bugs evaluated per tool). Most notably, our access to Greptile’s code review functionality was disabled midway through our evaluation. We excluded from our evaluation any Pull Requests where the code review tool did not explicitly indicate within the GitHub PR that the review was completed successfully.
It is possible that we were invisibly rate-limited by some of these tools, which could have impacted results. To mitigate this risk, for each PR that we included in our result set, we confirmed that each tool indicated a successful PR (e.g. a Check Run shown, or a footer generated which quantified a non-zero number of files reviewed).
Macroscope has a hybrid pricing model to accommodate unusual circumstances where a customer’s codebase activity exceeds typical volumes. In these cases, rather than rate-limit our customers, we allow “usage-based” overages which are charged per commit processed and per PR review. The quantity of usage simulated by this benchmark evaluation is considerably below the threshold that would have triggered usage-based overages on Macroscope’s lowest-tier monthly plan.
We have used the results of this benchmarking data to fix bugs and improve our code review pipeline (e.g. fix bugs in our AST walkers), however we have not encoded any of the bugs in our dataset into our pipelines, nor trained any models on this data. We acknowledge that the other tools we evaluated did not have the same opportunity to fix any issues that they would have encountered with this exact dataset.
Due to the scale of our dataset, we did not manually validate every false result to ensure that the LLM classification was correct, so it’s possible that some false detections (for every tool, including Macroscope) were incorrect. We did manually verify a random subset of 5 false detections per tool.
Due to the approach we took generating our dataset, the distribution of bugs across languages is uneven (see Chart 1). This unevenness did not benefit the overall results in our favor relative to other tools. In fact, when examining the unweighted averages (i.e. averaging each tool’s bug detection rate across their per-language bug detection rates), we observed that our relative overall ranking remained unchanged.
This evaluation was conducted between August 25 2025 and September 14 2025. Because each tool is constantly changing, it’s possible that the results would be different if the same benchmark were rerun today - including for Macroscope.

Read Entire Article