‘A complete accuracy collapse’: Apple throws cold water on the potential of AI reasoning – and it's a huge blow for the likes of OpenAI, Google, and Anthropic

Presented with complex logic puzzles, AI reasoning models simply gave up

Six popular AI app logos pictured on a smartphone screen, including DeepSeek, Copilot, Perplexity, OpenAI's ChatGPT, Anthropic Claude, and Google Gemini, three of which were highlighted in a recent Apple AI reasoning research paper disputing their effectiveness.
(Image credit: Getty Images)

Apple has suggested that AI reasoning models have clear limits when it comes to solving complex problems, undermining developer arguments that they are useful for tasks that a human would traditionally solve.

Reasoning models can solve more complex problems than standard large language models (LLMs) by breaking them down into a series of smaller problems which are solved one by one.

A host of major providers, including OpenAI, Anthropic, and Google have highlighted the benefits of reasoning models over the last year, touting these as a key weapon in the enterprise AI arsenal.

The paper, titled The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, specifically names OpenAI’s o1 and o3 models, DeepSeek R1, Anthropic’s Claude 3.7 Sonnet, and the latest version of Google’s Gemini.

It found that while these models efficiently handled low complexity tasks, using relatively few tokens to solve the given problem, beyond a certain level of complexity they began to take much longer to respond, waste tokens, and return incorrect answers.

Apple’s research team stated that the benchmarks currently used to evaluate what it calls large reasoning models (LRMs) are flawed, as they tend to center around how well a given model can solve coding and mathematical problems.

The researchers argued the results of these benchmarks are susceptible to data contamination – in which the answers are accidentally incorporated into a model’s training phase – and don’t give researchers proper control over the variables.

In response, they devised new puzzles for the RLMs with an emphasis on logical reasoning and no requirement for the model to use outside knowledge to reach an answer. The results showed that beyond a certain level of complexity, the models demonstrated total failure.

“Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities,” the researchers wrote.

One of the puzzles researchers used was the Tower of Hanoi, in which disks are stacked in order of size atop a rod next to two others. The player must move the ‘tower’ of disks in from the first rod to the third by moving only one disk per turn, and without a larger disk ever being set atop one smaller than itself.

With just three disks at the start, the models were able to solve the problem and were graded highly for each step they took toward the solution. But beyond a set number of disks, the accuracy of all models fell to zero.

Researchers noted this was the case even when the solution algorithm was provided in the prompt, implying the limitation is inherent to reasoning models. Even with less complex problems, models were also found to ‘overthink’ – finding the solution early on, then wasting time and tokens considering other, incorrect paths to the solution.

Overall, the paper suggests that LRMs are far less generalizable than providers have suggested, with a tendency to reduce computation as problems increase in difficulty rather than increase it and power through to the solution.

The work was carried out by Apple researchers Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. Between them, the team includes experts in efficient LLM usage, AI and machine learning, as well as an intern focused on deep learning and reinforcement learning.

A huge blow for the ‘reasoning’ crowd

Apple’s research paper seriously rains on the parade of the world’s most prominent AI developers, most of whom have spent the past nine months shouting from the rooftops about the potential for reasoning models.

The exact claims about what ‘reasoning’ constitutes have always been hazy, to say the least. In its announcement post for o1-preview, OpenAI claimed it had developed a new family of models that could “spend more time thinking through problems before they respond, much like a person would” and even “refine their thinking process” based on mistakes.

Similarly, Google has hyped up how Gemini 2.5’s reasoning helps it handle more complicated problems while Anthropic even claimed that 3.7 Sonnet was specifically designed with “real-world tasks that better reflect how businesses actually use LLMs” rather than math and computer science problems.

This research paper throws these claims into serious doubt, particularly in its systematic approach to showing that LRMs get less accurate, not more, when presented with harder tasks.

Earlier LLMs have tended to prioritize information given at the beginning and end of large prompts and brush over the context provided in the middle. This has meant they’ve historically struggled with so-called ‘needle in a haystack’ problems, in which they are tested on their ability to respond to information buried somewhere in the middle of a prompt in excess of 128,000 tokens.

What we’re seeing here could be the reasoning model equivalent, a fatal flaw at the heart of the algorithms powering LRMs that cause them to lose ‘focus’ on a problem and inevitably fail when allowed to work on a problem for too long.

It’s worth noting that the researchers allowed the models to use their full ‘budget’ for thinking – for example, 64,000 tokens for Claude 3.7 Sonnet – but that beyond a certain level of complexity, the models simply stopped spending tokens on reasoning through the problem further.

No one wants AI models that give up if they decide a problem is too hard, not least if they’ve been promised these are the ones built for the hardest problems.

If the limitations are indeed inherent to our current training approaches, developers are facing the headache of quickly implementing new approaches and architectures that solve this issue.

For now, customers may start to question if the ‘thinking’ models they’ve been sold are truly up to the task.

MORE FROM ITPRO

Rory Bathgate
Features and Multimedia Editor

Rory Bathgate is Features and Multimedia Editor at ITPro, overseeing all in-depth content and case studies. He can also be found co-hosting the ITPro Podcast with Jane McCallion, swapping a keyboard for a microphone to discuss the latest learnings with thought leaders from across the tech sector.

In his free time, Rory enjoys photography, video editing, and good science fiction. After graduating from the University of Kent with a BA in English and American Literature, Rory undertook an MA in Eighteenth-Century Studies at King’s College London. He joined ITPro in 2022 as a graduate, following four years in student journalism. You can contact Rory at rory.bathgate@futurenet.com or on LinkedIn.