‘LLMs are unreliable delegates’: Microsoft researchers say you probably shouldn’t trust AI with work documents

A research paper from Microsoft shows AI degrades documents over longer workflows

AI slop concept image showing a robot with glowing yellow eyes vomiting green liquid with term 'AI slop' in the flowing liquid.
(Image credit: Getty Images)

AI providers keep telling us to hand off work to agents, but research from Microsoft suggests that might not be a wise move.

In a pre-print paper, a trio of Microsoft researchers found that large language models (LLMs) corrupt documents over the course of long, extensive workflows, resulting in data deletion and even hallucinations.

Top tier foundational models – including Gemini 3.1 Pro, Claude Opus 4.6, GPT 5.4 – corrupted an average of 25% of content in a document during the research, with other models more than half.

“Delegation requires trust – the expectation that the LLM will faithfully execute the task without introducing errors into documents," said researchers Philippe Laban, Tobias Schnabel, and Jennifer Neville.

Latest Videos From

"Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction."

The work comes amid growing challenges with extensive AI use, with one study showing workers were wasting half a day each week fixing AI-made "workslop."

Even in coding, AI is fast but creates flaws in software, according to a study by CodeRabbit. The fact that half of software developers aren't checking AI-generated code is also compounding the issue, separate research shows.

Testing AI delegation

To test how well LLMs manage documents, the trio of researchers built a tool called DELEGATE-25 to simulate long workflows featuring in-depth document editing across 52 professional domains, including coding. They used that tool across 19 LLMs, including the aforementioned frontier models.

DELEGATE-25 tracked the changes made in text to documents as they were manipulated by the LLMs over five to ten complex editing tasks in each domain.

The team found that a quarter of document content was lost across frontier models, with that rising to half across the average of all models tested.

Degradation levels depended on the domain, with LLMs doing better in programming than natural language or niche settings, such as earning statements or music notation.

A model was deemed "ready" to be delegated tasks in a specific domain if it could score 98% accuracy after 20 interactions. Only one domain saw that score across most models: Python. Gemini 3.1 Pro had the best performance, scoring at least 98% in 11 of 52 domains.

Turning to agents doesn't help, either, with additional experiments showing agents don't improve performance. The degradation was the worst for larger documents over longer interaction periods.

Researchers said their work showed "that models are not ready for delegated workflows in the vast majority of domains, with models severely corrupting documents (at least -20% degradation) in 80% of our simulated conditions."

The results weren't down to consistently adding small errors with every step, the researchers noted. Instead, some of the time models worked perfectly, while other times they lost massive amounts of data in one go.

"Stronger models do not avoid small errors better; they delay critical failures and experience them in fewer Interactions," they said.

Unchecked errors

Because all of the models tested saw degradation in longer workflows – except when working in Python – researchers argued that longer interactions are necessary to benchmark LLM performance.

They did note improvement in systems by testing GPT-4o and GPT-5.4, with a leap in performance from 14.7% to 71.5%.

However, the study added that improvements to LLMs had led to workers supervising AI systems while they complete tasks, such as vibe coding.

"Crucially, users delegating work might lack the expertise or time to review changes implemented by the LLM, and must trust that the LLM does not introduce unchecked errors, such as hallucinations, or deletions," they noted.

For those working with AI, they advised caution, saying we shouldn't assume an AI that works well in one domain can manage the same in another.

"Current LLMs are ready for delegated workflows in some domains such as Python coding, but not in other less common domains," the paper noted.

"In general, users still need to closely monitor LLM systems as they operate and complete tasks on their behalf."

FOLLOW US ON SOCIAL MEDIA

Follow ITPro on Google News and add us as a preferred source to keep tabs on all our latest news, analysis, views, and reviews.

You can also follow ITPro on LinkedIn, X, Facebook, and BlueSky.

Freelance journalist Nicole Kobie first started writing for ITPro in 2007, with bylines in New Scientist, Wired, PC Pro and many more.

Nicole the author of a book about the history of technology, The Long History of the Future.