‘A complete accuracy collapse’: Apple throws cold water on the potential of AI reasoning – and it's a huge blow for the likes of OpenAI, Google, and Anthropic
Presented with complex logic puzzles, AI reasoning models simply gave up
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
You are now subscribed
Your newsletter sign-up was successful
Apple has suggested that AI reasoning models have clear limits when it comes to solving complex problems, undermining developer arguments that they are useful for tasks that a human would traditionally solve.
Reasoning models can solve more complex problems than standard large language models (LLMs) by breaking them down into a series of smaller problems which are solved one by one.
A host of major providers, including OpenAI, Anthropic, and Google have highlighted the benefits of reasoning models over the last year, touting these as a key weapon in the enterprise AI arsenal.
The paper, titled The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, specifically names OpenAI’s o1 and o3 models, DeepSeek R1, Anthropic’s Claude 3.7 Sonnet, and the latest version of Google’s Gemini.
It found that while these models efficiently handled low complexity tasks, using relatively few tokens to solve the given problem, beyond a certain level of complexity they began to take much longer to respond, waste tokens, and return incorrect answers.
Apple’s research team stated that the benchmarks currently used to evaluate what it calls large reasoning models (LRMs) are flawed, as they tend to center around how well a given model can solve coding and mathematical problems.
The researchers argued the results of these benchmarks are susceptible to data contamination – in which the answers are accidentally incorporated into a model’s training phase – and don’t give researchers proper control over the variables.
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
In response, they devised new puzzles for the RLMs with an emphasis on logical reasoning and no requirement for the model to use outside knowledge to reach an answer. The results showed that beyond a certain level of complexity, the models demonstrated total failure.
“Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities,” the researchers wrote.
One of the puzzles researchers used was the Tower of Hanoi, in which disks are stacked in order of size atop a rod next to two others. The player must move the ‘tower’ of disks in from the first rod to the third by moving only one disk per turn, and without a larger disk ever being set atop one smaller than itself.
With just three disks at the start, the models were able to solve the problem and were graded highly for each step they took toward the solution. But beyond a set number of disks, the accuracy of all models fell to zero.
Researchers noted this was the case even when the solution algorithm was provided in the prompt, implying the limitation is inherent to reasoning models. Even with less complex problems, models were also found to ‘overthink’ – finding the solution early on, then wasting time and tokens considering other, incorrect paths to the solution.
Overall, the paper suggests that LRMs are far less generalizable than providers have suggested, with a tendency to reduce computation as problems increase in difficulty rather than increase it and power through to the solution.
The work was carried out by Apple researchers Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. Between them, the team includes experts in efficient LLM usage, AI and machine learning, as well as an intern focused on deep learning and reinforcement learning.
A huge blow for the ‘reasoning’ crowd
Apple’s research paper seriously rains on the parade of the world’s most prominent AI developers, most of whom have spent the past nine months shouting from the rooftops about the potential for reasoning models.
The exact claims about what ‘reasoning’ constitutes have always been hazy, to say the least. In its announcement post for o1-preview, OpenAI claimed it had developed a new family of models that could “spend more time thinking through problems before they respond, much like a person would” and even “refine their thinking process” based on mistakes.
Similarly, Google has hyped up how Gemini 2.5’s reasoning helps it handle more complicated problems while Anthropic even claimed that 3.7 Sonnet was specifically designed with “real-world tasks that better reflect how businesses actually use LLMs” rather than math and computer science problems.
This research paper throws these claims into serious doubt, particularly in its systematic approach to showing that LRMs get less accurate, not more, when presented with harder tasks.
Earlier LLMs have tended to prioritize information given at the beginning and end of large prompts and brush over the context provided in the middle. This has meant they’ve historically struggled with so-called ‘needle in a haystack’ problems, in which they are tested on their ability to respond to information buried somewhere in the middle of a prompt in excess of 128,000 tokens.
What we’re seeing here could be the reasoning model equivalent, a fatal flaw at the heart of the algorithms powering LRMs that cause them to lose ‘focus’ on a problem and inevitably fail when allowed to work on a problem for too long.
It’s worth noting that the researchers allowed the models to use their full ‘budget’ for thinking – for example, 64,000 tokens for Claude 3.7 Sonnet – but that beyond a certain level of complexity, the models simply stopped spending tokens on reasoning through the problem further.
No one wants AI models that give up if they decide a problem is too hard, not least if they’ve been promised these are the ones built for the hardest problems.
If the limitations are indeed inherent to our current training approaches, developers are facing the headache of quickly implementing new approaches and architectures that solve this issue.
For now, customers may start to question if the ‘thinking’ models they’ve been sold are truly up to the task.
MORE FROM ITPRO

Rory Bathgate is Features and Multimedia Editor at ITPro, overseeing all in-depth content and case studies. He can also be found co-hosting the ITPro Podcast with Jane McCallion, swapping a keyboard for a microphone to discuss the latest learnings with thought leaders from across the tech sector.
In his free time, Rory enjoys photography, video editing, and good science fiction. After graduating from the University of Kent with a BA in English and American Literature, Rory undertook an MA in Eighteenth-Century Studies at King’s College London. He joined ITPro in 2022 as a graduate, following four years in student journalism. You can contact Rory at rory.bathgate@futurenet.com or on LinkedIn.
-
Low-budget devices are the biggest casualty of the RAM crisisNews Say goodbye to budget devices; vendors are doubling down on high-end options to absorb costs
-
Sectigo taps Clint Maddox to lead global field operationsReviews The appointment follows a year of strong momentum for the security vendor as it expands its global channel footprint
-
Microsoft has a new AI poster child in Anthropic – and it’s about timeOpinion Microsoft is cosying up to Anthropic at a crucial time in the race to deliver on AI promises
-
Concerns are mounting over the cognitive impact of AI as workers report experiencing ‘brain fry’ – and it’s causing "increased employee errors, decision fatigue, and intention to quit"News Research from Boston Consulting Group backs earlier studies in highlighting the negative cognitive impact of AI at work
-
Anthropic's Claude Cowork tool is coming to Microsoft CopilotNews The new Copilot Cowork tool will be made available through a new Microsoft 365 tier at the end of March
-
Will AI hiring entrench gender bias?ITPro Podcast This International Women's Day, it's more important than ever to consider the inherent biases of training data
-
Why Amazon’s ‘go build it’ AI strategy aligns with OpenAI’s big enterprise pushNews OpenAI and Amazon are both vying to offer customers DIY-style AI development services
-
February rundown: SaaS-pocalypse now?ITPro Podcast Geopolitical uncertainty is intensifying public and private sector focus on true sovereign workloads
-
‘A huge vote of confidence’: London set to host OpenAI's largest research hub outside USNews OpenAI wants to capitalize on the UK’s “world-class” talent in areas such as machine learning
-
If you thought RTO battles were bad, wait until AI mandates start taking hold across the industryOpinion Forcing workers to adopt AI under the threat of poor performance reviews and losing out on promotions will only create friction