‘A complete accuracy collapse’: Apple throws cold water on the potential of AI reasoning – and it's a huge blow for the likes of OpenAI, Google, and Anthropic
Presented with complex logic puzzles, AI reasoning models simply gave up
Apple has suggested that AI reasoning models have clear limits when it comes to solving complex problems, undermining developer arguments that they are useful for tasks that a human would traditionally solve.
Reasoning models can solve more complex problems than standard large language models (LLMs) by breaking them down into a series of smaller problems which are solved one by one.
A host of major providers, including OpenAI, Anthropic, and Google have highlighted the benefits of reasoning models over the last year, touting these as a key weapon in the enterprise AI arsenal.
The paper, titled The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, specifically names OpenAI’s o1 and o3 models, DeepSeek R1, Anthropic’s Claude 3.7 Sonnet, and the latest version of Google’s Gemini.
It found that while these models efficiently handled low complexity tasks, using relatively few tokens to solve the given problem, beyond a certain level of complexity they began to take much longer to respond, waste tokens, and return incorrect answers.
Apple’s research team stated that the benchmarks currently used to evaluate what it calls large reasoning models (LRMs) are flawed, as they tend to center around how well a given model can solve coding and mathematical problems.
The researchers argued the results of these benchmarks are susceptible to data contamination – in which the answers are accidentally incorporated into a model’s training phase – and don’t give researchers proper control over the variables.
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
In response, they devised new puzzles for the RLMs with an emphasis on logical reasoning and no requirement for the model to use outside knowledge to reach an answer. The results showed that beyond a certain level of complexity, the models demonstrated total failure.
“Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities,” the researchers wrote.
One of the puzzles researchers used was the Tower of Hanoi, in which disks are stacked in order of size atop a rod next to two others. The player must move the ‘tower’ of disks in from the first rod to the third by moving only one disk per turn, and without a larger disk ever being set atop one smaller than itself.
With just three disks at the start, the models were able to solve the problem and were graded highly for each step they took toward the solution. But beyond a set number of disks, the accuracy of all models fell to zero.
Researchers noted this was the case even when the solution algorithm was provided in the prompt, implying the limitation is inherent to reasoning models. Even with less complex problems, models were also found to ‘overthink’ – finding the solution early on, then wasting time and tokens considering other, incorrect paths to the solution.
Overall, the paper suggests that LRMs are far less generalizable than providers have suggested, with a tendency to reduce computation as problems increase in difficulty rather than increase it and power through to the solution.
The work was carried out by Apple researchers Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. Between them, the team includes experts in efficient LLM usage, AI and machine learning, as well as an intern focused on deep learning and reinforcement learning.
A huge blow for the ‘reasoning’ crowd
Apple’s research paper seriously rains on the parade of the world’s most prominent AI developers, most of whom have spent the past nine months shouting from the rooftops about the potential for reasoning models.
The exact claims about what ‘reasoning’ constitutes have always been hazy, to say the least. In its announcement post for o1-preview, OpenAI claimed it had developed a new family of models that could “spend more time thinking through problems before they respond, much like a person would” and even “refine their thinking process” based on mistakes.
Similarly, Google has hyped up how Gemini 2.5’s reasoning helps it handle more complicated problems while Anthropic even claimed that 3.7 Sonnet was specifically designed with “real-world tasks that better reflect how businesses actually use LLMs” rather than math and computer science problems.
This research paper throws these claims into serious doubt, particularly in its systematic approach to showing that LRMs get less accurate, not more, when presented with harder tasks.
Earlier LLMs have tended to prioritize information given at the beginning and end of large prompts and brush over the context provided in the middle. This has meant they’ve historically struggled with so-called ‘needle in a haystack’ problems, in which they are tested on their ability to respond to information buried somewhere in the middle of a prompt in excess of 128,000 tokens.
What we’re seeing here could be the reasoning model equivalent, a fatal flaw at the heart of the algorithms powering LRMs that cause them to lose ‘focus’ on a problem and inevitably fail when allowed to work on a problem for too long.
It’s worth noting that the researchers allowed the models to use their full ‘budget’ for thinking – for example, 64,000 tokens for Claude 3.7 Sonnet – but that beyond a certain level of complexity, the models simply stopped spending tokens on reasoning through the problem further.
No one wants AI models that give up if they decide a problem is too hard, not least if they’ve been promised these are the ones built for the hardest problems.
If the limitations are indeed inherent to our current training approaches, developers are facing the headache of quickly implementing new approaches and architectures that solve this issue.
For now, customers may start to question if the ‘thinking’ models they’ve been sold are truly up to the task.
MORE FROM ITPRO

Rory Bathgate is Features and Multimedia Editor at ITPro, overseeing all in-depth content and case studies. He can also be found co-hosting the ITPro Podcast with Jane McCallion, swapping a keyboard for a microphone to discuss the latest learnings with thought leaders from across the tech sector.
In his free time, Rory enjoys photography, video editing, and good science fiction. After graduating from the University of Kent with a BA in English and American Literature, Rory undertook an MA in Eighteenth-Century Studies at King’s College London. He joined ITPro in 2022 as a graduate, following four years in student journalism. You can contact Rory at rory.bathgate@futurenet.com or on LinkedIn.
-
If Satya Nadella wants us to take AI seriously, let’s forget about mass adoption and start with a return on investment for those already using itOpinion If Satya Nadella wants us to take AI seriously, let's start with ROI for businesses
-
Are hyperscalers backing out of Net Zero?ITPro Podcast Expanding data center construction and demand for high-energy workloads are pushing hyperscalers off course on carbon
-
Half of agentic AI projects are still stuck at the pilot stage – but that’s not stopping enterprises from ramping up investmentNews Organizations are stymied by issues with security, privacy, and compliance, as well as the technical challenges of managing agents at scale
-
What Anthropic's constitution changes mean for the future of ClaudeNews The developer debates AI consciousness while trying to make Claude chatbot behave better
-
Satya Nadella says a 'telltale sign' of an AI bubble is if it only benefits tech companies – but the technology is now having a huge impact in a range of industriesNews Microsoft CEO Satya Nadella appears confident that the AI market isn’t in the midst of a bubble, but warned widespread adoption outside of the technology industry will be key to calming concerns.
-
DeepSeek rocked Silicon Valley in January 2025 – one year on it looks set to shake things up again with a powerful new model releaseAnalysis The Chinese AI company sent Silicon Valley into meltdown last year and it could rock the boat again with an upcoming model
-
Workers are wasting half a day each week fixing AI ‘workslop’News Better staff training and understanding of the technology is needed to cut down on AI workslop
-
Anthropic’s Claude AI chatbot is down as company confirms ‘elevated error rates’ for Opus 4.5 and Sonnet 4.5News Users of Anthropic's Sonnet 4.5 and Opus 4.5 models are being met with "elevated error rates"
-
Google’s Apple deal is a major seal of approval for Gemini – and a sure sign it's beginning to pull ahead of OpenAI in the AI raceAnalysis Apple opting for Google's models to underpin Siri and Apple Intelligence is a major seal of approval for the tech giant's Gemini range – and a sure sign it's pulling ahead in the AI race.
-
Everything you need to know about Claude Cowork, including features, pricing, and how to access the new productivity toolNews Users can give Claude Cowork access to specific folders on their computer, allowing the bot to autonomously sort and organize files in the background while you're working away.