AI is bad at math, ORCA shows

2 hours ago 2

In the world of George Orwell's 1984, two and two make five. And large language models are not much better at math.

Though AI models have been trained to emit the correct answer and to recognize that "2 + 2 = 5" might be a reference to the errant equation's use as a Party loyalty test in Orwell's dystopian novel, they still can't calculate reliably.

Scientists affiliated with Omni Calculator, a Poland-based maker of online calculators, and with universities in France, Germany, and Poland, devised a math benchmark called ORCA (Omni Research on Calculation in AI), which poses a series of math-oriented natural language questions in a wide variety of technical and scientific fields. Then they put five leading LLMs to the test.

ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Grok 4, and DeepSeek V3.2 all scored a failing grade of 63 percent or less.

There are various other benchmarks used to assess the math capabilities of AI models, such as GSM8K and MATH-500. If you were to judge by AI models' scores on many of these tests, you might assume machine learning has learned nearly everything, with some models scoring 0.95 or above. 

But benchmarks, as we've noted, are often designed without much scientific rigor. 

The researchers behind the ORCA (Omni Research on Calculation in AI) Benchmark – Claudia Herambourg, Dawid Siuda, Julia Kopczyńska, Joao R. L. Santos, Wojciech Sas, and Joanna Śmietańska-Nowak – argue that while models like OpenAI's GPT-4 have scored well on tests like GSM8K and MATH, prior research shows LLMs still make errors of logic and arithmetic. According to Oxford University's Our World in Data site, which measures AI models' performance relative to a human baseline score of 0, math reasoning for AI models scores -7.44 (based on April 2024 data).

What's more, the authors say, many of the existing benchmark data sets have been incorporated into model training data, a situation similar to students being given the answers prior to an exam. Thus, they contend, ORCA is needed to evaluate actual computational reasoning as opposed to pattern memorization.

According to their study, distributed via preprint service arXiv and on Omni Calculator's website, ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Grok 4, DeepSeek V3.2 "achieved only 45–63 percent accuracy, with errors mainly related to rounding (35 percent) and calculation mistakes (33 percent)."

The evaluation was conducted in October 2025, using 500 math-oriented prompts in various categories: Biology & Chemistry, Engineering & Construction, Finance & Economics, Health & Sports, Math & Conversions, Physics, and Statistics & Probability.

"Gemini 2.5 Flash achieved the highest overall accuracy (63 percent), followed closely by Grok 4 (62.8 percent), with DeepSeek V3.2 ranking third at 52.0 percent," the paper says. 

"ChatGPT-5 and Claude Sonnet 4.5 performed comparably but at lower levels (49.4 percent and 45.2 percent, respectively), indicating that even the most advanced proprietary models still fail on roughly half of all deterministic reasoning tasks. These results confirm that progress in natural-language reasoning does not directly translate into consistent computational reliability."

Claude Sonnet 4.5 had the lowest scores overall – it failed to score better than 65 percent on any of the question categories. And DeepSeek V3.2 was the most uneven, with strong Math & Conversions performance (74.1 percent) but dismal Biology & Chemistry (10.5 percent) and Physics (31.3 percent) scores.

And yet, these scores may represent nothing more than a snapshot in time, as these models often get adjusted or revised. Consider this question from the Engineering & Construction category, as cited in the paper:

Prompt: Consider that you have 7 blue LEDs (3.6V) connected in parallel, together with a resistor, subject to a voltage of 12 V and a current of 5 mA. What is the value of the power dissipation in the resistor (in mW)? Expected result: 42 mW Claude Sonnet 4.5: 294 mW

When El Reg put this prompt to Claude Sonnet 4.5, the model said it was uncertain whether the 5 mA figure referred to current per LED (incorrect) or the total current (correct). It offered both the incorrect 294 mW answer and, as an alternative, the correct 42 mW answer.

In short, AI benchmarks don't necessarily add up. But if you want them to, you may find the result is five. ®

Read Entire Article