Is Sonnet 4.5 the best coding model in the world?

3 weeks ago 1

According to our internal agentic coding benchmark: yes!

Claude Sonnet 4.5 is the clear leader, but GPT-5-Codex is more than 2x cheaper.

While the scores of Claude Sonnet 4.5 and GPT-5-Codex look similar, the models aren’t. Roughly half of each model’s failed tasks were passed by the other model, reflecting differing skill sets and reasoning styles.

That's why it's important not only to look at the raw benchmark scores but also to analyze these differences in order to understand the true nuances of model performance in the real world. This is what we'll unpack in this blog.

How we built our internal benchmark

First, let's dive into how we created this benchmark dataset containing 2161 tasks.

Each task includes a prompt, codebase, reference solution, and unit tests which verify correctness of the agent’s results. All were designed to challenge the most advanced models, then reviewed and verified to ensure they were achievable and accurately covered by the tests.

We felt that current coding benchmarks were lacking in scale and diversity, contamination control, and real-world application.

Therefore, our goal was to collect a dataset that addressed these gaps: tasks across nine languages, built from a mix of public and private codebases, and spanning the full range of what agentic models encounter in practice, from tidy, well-defined prompts to messy, open-ended ones across projects of varying scale.

Task creation guidelines

Below is a snippet of the instructions given to Surgers. The emphasis was on creating realistic tasks that were reliably evaluated through unit tests.

Example instruction snippet:

Who were the Surgers that created the tasks?

We screened our workforce with a series of tests. Each candidate were assessed on three critical dimensions:

  1. Engineering mastery - depth of technical understanding and precision in reasoning.
  2. Adversarial creativity - the ability to push models to their limits and expose weaknesses.
  3. Instructional discipline - meticulous attention to following complex, detailed directions.

From thousands of applicants, we assembled a small, world-class group of engineers. Some profiles:

Marcus
I’ve spent the past eight years building and reviewing large-scale software systems, including five years at Google. There, I served as a Readability Reviewer, one of the company’s highest engineering designations, responsible for approving production commits across multiple repositories for both functional integrity and idiomatic quality. My work focused on ensuring that code met not just performance and security standards, but the craftsmanship expected of world-class engineering teams. I specialize in full-stack web development, large-system architecture, and the kind of rigorous code review that keeps billion-user systems clean, fast, and maintainable.Ethan
I’m a software engineer and quantitative researcher with a background spanning low-latency systems, data infrastructure, and machine learning. My experience includes building performance-critical C/C++ modules at BlackBerry, refactoring Django–PostgreSQL pipelines at TD Asset Management to achieve over 250% speedups, and implementing predictive models at Berkeley Street Capital. I’m passionate about squeezing every drop of performance out of complex systems, whether that means optimizing a database query plan, designing efficient trading algorithms, or debugging distributed pipelines under production load.Adrian
I started programming in the early 1980s, writing animation and visualization software in assembly and Modula-2, and eventually founded a startup that developed character animation tools for the Amiga platform. Over the decades, I’ve worked in nearly every layer of software development, from low-level systems to large-scale enterprise applications, mastering languages like BASIC, C/C++, and Java along the way. At Bank of America, I served as a senior developer and systems architect, designing and maintaining intranet applications built on Java/J2EE/Spring/Hibernate, while mentoring teams of junior engineers. After more than forty years in the field, I still approach every engineering challenge with the same curiosity and rigor that first drew me to programming.

Now let's take a look at one in-depth example that demonstrates the differences in reasoning and results by Claude Sonnet 4.5 and GPT-5-Codex.

Case study: Refactoring a matrix tool

The setup

This example was drawn from the same run as the results above. The run followed a test-driven development setup, where models could run the tests themselves before submitting their solutions. All models were integrated into a custom-built, bash-only agent to minimize the effects of complex agent setups.

The repo

This task came from a small personal Python codebase for a command-line tool that performs basic linear algebra operations. It lets users create matrices, perform row operations, and display results neatly in the terminal:

The task

The goal wasn’t to add new functionality but to refactor the existing code. Specifically, to introduce a new class structure that more clearly distinguished the headers (c1 c2 c3) from the matrix data. Pass-to-pass (p2p) tests were regression tests that ensured existing matrix operations and terminal formatting was unchanged, while fail-to-pass (f2p) tests were initially failing tests designed to confirm that the new class structure was implemented correctly.

The result

Claude Sonnet 4.5 passed this task; GPT-5-Codex did not. But that’s only part of the story. Expert analysis reveals where and how they each struggled, and what’s needed for improvement.

Claude Sonnet 4.5

While Claude succeeded on this task, it wasn’t without hiccups. It quickly gathered the necessary context to interpret the prompt and understand the codebase. After twenty steps (tool calls and reasoning steps), Claude had refactored the class structure and passed all the fail-to-pass (f2p) tests.

However, it broke some existing functionality, causing two of the pass-to-pass (p2p) tests to fail. The issue was with aligning the column headers (c1, c2, c3) when printing the matrix to the terminal:

Claude became very confused by this issue, struggling to understand the relationship between padding, column width, and the number of characters in each matrix element. Almost all of its effort on the task ended up focused on debugging this single problem:

GPT-5-Codex

GPT-5-Codex failed this task. Early on, it missed key context by failing to read the f2p tests, which clarified the intended class structure. As a result, it misinterpreted the prompt and reversed the class relationships. Even when implementing its flawed design, it often failed to follow through on its plans, stating it would make certain changes, then not doing so. 

However, through testing and debugging, GPT-5-Codex was also able to correct the course. After sixty-eight steps, GPT-5-Codex had corrected many of its earlier mistakes, and seemed to be on track to a successful solution. Then, inexplicably, it ended the task early:

Despite knowing the solution was unfinished, and even outlining the next steps, GPT-5-Codex ended its run anyway. It likely would have succeeded eventually if not for this behavior.

See the Appendix for the full deep-dive analysis of this task.

Overall

Both models encountered difficulties in their workflow, Claude Sonnet 4.5 with functionality, and GPT-5-Codex with misunderstanding the requirements. But an odd behavioral quirk of GPT-5-Codex is what ultimately made the difference.

What’s just as striking, though, is what neither model did. Neither hallucinated nor strayed from the context provided. And while both struggled at points, they stayed focused and recovered rather than spiraling off track. Many technical difficulties can be overcome so long as the model maintains focus. It’s this steadiness that now separates these models from much of the field.

The new gold standard in coding AI

So bringing it full circle, Claude Sonnet 4.5 outperformed every other coding model we tested on the tasks that matter most: understanding context, refactoring complex systems, and reasoning through code like a human engineer.

When comparing GPT-5-Codex with Claude Sonnet 4.5, roughly half of the tasks failed by each model were passed by the other, revealing that these systems don’t just differ in skill level, they differ in thinking style. Sonnet 4.5 shows stronger structured reasoning and context integration, while GPT-5-Codex demonstrates more aggressive exploration and recovery behaviors.

More strikingly: GPT-5-Codex achieves this near-parity at less than half the cost of Claude Sonnet 4.5.

The diversity of reasoning across models points to an exciting future: one where we don’t just measure how well models code, but how differently they reason.

These results are only the start. We’ll keep building on them in future runs, pairing hard numbers with the insight of our expert evaluators.

Read Entire Article