‘A complete accuracy collapse’: Apple throws cold water on the potential of AI reasoning – and it's a huge blow for the likes of OpenAI, Google, and Anthropic
Presented with complex logic puzzles, AI reasoning models simply gave up


Apple has suggested that AI reasoning models have clear limits when it comes to solving complex problems, undermining developer arguments that they are useful for tasks that a human would traditionally solve.
Reasoning models can solve more complex problems than standard large language models (LLMs) by breaking them down into a series of smaller problems which are solved one by one.
A host of major providers, including OpenAI, Anthropic, and Google have highlighted the benefits of reasoning models over the last year, touting these as a key weapon in the enterprise AI arsenal.
The paper, titled The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, specifically names OpenAI’s o1 and o3 models, DeepSeek R1, Anthropic’s Claude 3.7 Sonnet, and the latest version of Google’s Gemini.
It found that while these models efficiently handled low complexity tasks, using relatively few tokens to solve the given problem, beyond a certain level of complexity they began to take much longer to respond, waste tokens, and return incorrect answers.
Apple’s research team stated that the benchmarks currently used to evaluate what it calls large reasoning models (LRMs) are flawed, as they tend to center around how well a given model can solve coding and mathematical problems.
The researchers argued the results of these benchmarks are susceptible to data contamination – in which the answers are accidentally incorporated into a model’s training phase – and don’t give researchers proper control over the variables.
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
In response, they devised new puzzles for the RLMs with an emphasis on logical reasoning and no requirement for the model to use outside knowledge to reach an answer. The results showed that beyond a certain level of complexity, the models demonstrated total failure.
“Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities,” the researchers wrote.
One of the puzzles researchers used was the Tower of Hanoi, in which disks are stacked in order of size atop a rod next to two others. The player must move the ‘tower’ of disks in from the first rod to the third by moving only one disk per turn, and without a larger disk ever being set atop one smaller than itself.
With just three disks at the start, the models were able to solve the problem and were graded highly for each step they took toward the solution. But beyond a set number of disks, the accuracy of all models fell to zero.
Researchers noted this was the case even when the solution algorithm was provided in the prompt, implying the limitation is inherent to reasoning models. Even with less complex problems, models were also found to ‘overthink’ – finding the solution early on, then wasting time and tokens considering other, incorrect paths to the solution.
Overall, the paper suggests that LRMs are far less generalizable than providers have suggested, with a tendency to reduce computation as problems increase in difficulty rather than increase it and power through to the solution.
The work was carried out by Apple researchers Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. Between them, the team includes experts in efficient LLM usage, AI and machine learning, as well as an intern focused on deep learning and reinforcement learning.
A huge blow for the ‘reasoning’ crowd
Apple’s research paper seriously rains on the parade of the world’s most prominent AI developers, most of whom have spent the past nine months shouting from the rooftops about the potential for reasoning models.
The exact claims about what ‘reasoning’ constitutes have always been hazy, to say the least. In its announcement post for o1-preview, OpenAI claimed it had developed a new family of models that could “spend more time thinking through problems before they respond, much like a person would” and even “refine their thinking process” based on mistakes.
Similarly, Google has hyped up how Gemini 2.5’s reasoning helps it handle more complicated problems while Anthropic even claimed that 3.7 Sonnet was specifically designed with “real-world tasks that better reflect how businesses actually use LLMs” rather than math and computer science problems.
This research paper throws these claims into serious doubt, particularly in its systematic approach to showing that LRMs get less accurate, not more, when presented with harder tasks.
Earlier LLMs have tended to prioritize information given at the beginning and end of large prompts and brush over the context provided in the middle. This has meant they’ve historically struggled with so-called ‘needle in a haystack’ problems, in which they are tested on their ability to respond to information buried somewhere in the middle of a prompt in excess of 128,000 tokens.
What we’re seeing here could be the reasoning model equivalent, a fatal flaw at the heart of the algorithms powering LRMs that cause them to lose ‘focus’ on a problem and inevitably fail when allowed to work on a problem for too long.
It’s worth noting that the researchers allowed the models to use their full ‘budget’ for thinking – for example, 64,000 tokens for Claude 3.7 Sonnet – but that beyond a certain level of complexity, the models simply stopped spending tokens on reasoning through the problem further.
No one wants AI models that give up if they decide a problem is too hard, not least if they’ve been promised these are the ones built for the hardest problems.
If the limitations are indeed inherent to our current training approaches, developers are facing the headache of quickly implementing new approaches and architectures that solve this issue.
For now, customers may start to question if the ‘thinking’ models they’ve been sold are truly up to the task.
MORE FROM ITPRO

Rory Bathgate is Features and Multimedia Editor at ITPro, overseeing all in-depth content and case studies. He can also be found co-hosting the ITPro Podcast with Jane McCallion, swapping a keyboard for a microphone to discuss the latest learnings with thought leaders from across the tech sector.
In his free time, Rory enjoys photography, video editing, and good science fiction. After graduating from the University of Kent with a BA in English and American Literature, Rory undertook an MA in Eighteenth-Century Studies at King’s College London. He joined ITPro in 2022 as a graduate, following four years in student journalism. You can contact Rory at rory.bathgate@futurenet.com or on LinkedIn.
-
Europe's first exascale supercomputer, Jupiter, is now live
News Planned uses for Jupiter include climate research, medical research and the development of multi-language LLMs
-
Microsoft warns of slow Azure traffic
News Suspected Houthi attack on Red Sea cables likely to affect Europe-Asia connections for weeks
-
Businesses are being taken for fools with AI agents
Opinion AI agents are still not very good at their 'jobs', or at least pretty terrible at producing returns on investment – yet businesses are still buying into the hype.
-
Google boasts that a single Gemini prompt uses roughly the same energy as a basic search – but that’s not painting the full picture
News Google might claim that a single Gemini AI prompt consumes the same amount of energy as a basic search, but it's failing to paint the full picture on AI's environmental impact.
-
Mistral AI wants businesses to make new memories with Le Chat
News The company hopes new functionality and Connection Partners will broaden business appeal
-
Salesforce CEO Marc Benioff says the company has cut 4,000 customer support staff for AI agents so far
News The jury may still be out on whether generative AI is going to cause widespread job losses, but the impact of the technology is already being felt at Salesforce.
-
This Stanford study shows AI is starting to take jobs – and those identified as highest risk are eerily similar to a recent Microsoft study
News AI may already be impacting early-career jobs, particularly roles featuring tasks that can be automated like software developers or customer service, according to Stanford researchers.
-
Jensen Huang says 'the AI race is on' as Nvidia shrugs off market bubble concerns
News The Nvidia chief exec appears upbeat on the future of the AI market despite recent concerns
-
Meta’s chaotic AI strategy shows the company has ‘squandered its edge and is scrambling to keep pace’
Analysis Does Meta know where it's going with AI? Talent poaching, rabid investment, and now another rumored overhaul of its AI strategy suggests the tech giant is floundering.
-
Microsoft says these 10 jobs are at highest risk of being upended by AI – but experts say there's nothing to worry about yet
News Microsoft thinks AI is going to destroy jobs across a range of industries – while experts aren't fully convinced, maybe it's time to start preparing.