‘LLMs are unreliable delegates’: Microsoft researchers say you probably shouldn’t trust AI with work documents
A research paper from Microsoft shows AI degrades documents over longer workflows
AI providers keep telling us to hand off work to agents, but research from Microsoft suggests that might not be a wise move.
In a pre-print paper, a trio of Microsoft researchers found that large language models (LLMs) corrupt documents over the course of long, extensive workflows, resulting in data deletion and even hallucinations.
Top tier foundational models – including Gemini 3.1 Pro, Claude Opus 4.6, GPT 5.4 – corrupted an average of 25% of content in a document during the research, with other models more than half.
“Delegation requires trust – the expectation that the LLM will faithfully execute the task without introducing errors into documents," said researchers Philippe Laban, Tobias Schnabel, and Jennifer Neville.
"Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction."
The work comes amid growing challenges with extensive AI use, with one study showing workers were wasting half a day each week fixing AI-made "workslop."
Even in coding, AI is fast but creates flaws in software, according to a study by CodeRabbit. The fact that half of software developers aren't checking AI-generated code is also compounding the issue, separate research shows.
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
Testing AI delegation
To test how well LLMs manage documents, the trio of researchers built a tool called DELEGATE-25 to simulate long workflows featuring in-depth document editing across 52 professional domains, including coding. They used that tool across 19 LLMs, including the aforementioned frontier models.
DELEGATE-25 tracked the changes made in text to documents as they were manipulated by the LLMs over five to ten complex editing tasks in each domain.
The team found that a quarter of document content was lost across frontier models, with that rising to half across the average of all models tested.
Degradation levels depended on the domain, with LLMs doing better in programming than natural language or niche settings, such as earning statements or music notation.
A model was deemed "ready" to be delegated tasks in a specific domain if it could score 98% accuracy after 20 interactions. Only one domain saw that score across most models: Python. Gemini 3.1 Pro had the best performance, scoring at least 98% in 11 of 52 domains.
Turning to agents doesn't help, either, with additional experiments showing agents don't improve performance. The degradation was the worst for larger documents over longer interaction periods.
Researchers said their work showed "that models are not ready for delegated workflows in the vast majority of domains, with models severely corrupting documents (at least -20% degradation) in 80% of our simulated conditions."
The results weren't down to consistently adding small errors with every step, the researchers noted. Instead, some of the time models worked perfectly, while other times they lost massive amounts of data in one go.
"Stronger models do not avoid small errors better; they delay critical failures and experience them in fewer Interactions," they said.
Unchecked errors
Because all of the models tested saw degradation in longer workflows – except when working in Python – researchers argued that longer interactions are necessary to benchmark LLM performance.
They did note improvement in systems by testing GPT-4o and GPT-5.4, with a leap in performance from 14.7% to 71.5%.
However, the study added that improvements to LLMs had led to workers supervising AI systems while they complete tasks, such as vibe coding.
"Crucially, users delegating work might lack the expertise or time to review changes implemented by the LLM, and must trust that the LLM does not introduce unchecked errors, such as hallucinations, or deletions," they noted.
For those working with AI, they advised caution, saying we shouldn't assume an AI that works well in one domain can manage the same in another.
"Current LLMs are ready for delegated workflows in some domains such as Python coding, but not in other less common domains," the paper noted.
"In general, users still need to closely monitor LLM systems as they operate and complete tasks on their behalf."
FOLLOW US ON SOCIAL MEDIA
Follow ITPro on Google News and add us as a preferred source to keep tabs on all our latest news, analysis, views, and reviews.
You can also follow ITPro on LinkedIn, X, Facebook, and BlueSky.
Freelance journalist Nicole Kobie first started writing for ITPro in 2007, with bylines in New Scientist, Wired, PC Pro and many more.
Nicole the author of a book about the history of technology, The Long History of the Future.
-
SCC appoints former Microsoft leader to head up UK public sector businessNews Alexandra Wilkinson joins the Birmingham-based provider as it looks to expand public sector growth and AI adoption initiatives
-
UK government calls on firms to sign Cyber Resilience Pledge as security sector boomsNews With new figures showing a boom in the country's cybersecurity sector, the government calling on businesses to make the most of the industry’s expertise
-
Microsoft joins competitors in handing over AI models for advanced testingNews US and UK government agencies will evaluate the firm's frontier models, along with those from Google and xAI
-
The AI operations gap is reshaping the Microsoft channelHow are AI advancements shaping the moves channel partners are making and need to make going forward?
-
Accenture has been trialling Microsoft Copilot since 2023 – now it’s rolling out the AI tool to all 743,000 staffNews Accenture will roll out Microsoft Copilot to nearly three quarters of a million employees after years of testing
-
'That language is no longer reflective of how Copilot is used today': Microsoft says Copilot isn't just for 'entertainment purposes only'News Sharp-eyed users spotted Microsoft describing its Copilot AI as "for entertainment purposes only"
-
‘Fragmentation is poison’: How Microsoft is targeting disparate data to boost AI adoptionNews Amir Netz, the co-creator of Microsoft's Power BI service, tells ITPro that business context is key to effective AI deployment.
-
Microsoft is rolling out Copilot Cowork to more customersNews Use of Copilot Cowork has been limited to select customers so far
-
Satya Nadella needs to remember the Streisand effect for 'AI slop'Opinion Attempts to discourage criticism may backfire for Microsoft’s CEO
-
Microsoft has a new AI poster child in Anthropic – and it’s about timeOpinion Microsoft is cosying up to Anthropic at a crucial time in the race to deliver on AI promises
