Forum: Rise n' Shine

'Current LLMs introduce substantial errors when editing work docu

From TechnologyDaily@1337:1/100 to All on Tuesday, May 12, 2026 16:45:27

'Current LLMs introduce substantial errors when editing work documents': Microsoft scientists find most AI models struggle with long-running tasks so maybe don't trust them completely just yet

Date:
Tue, 12 May 2026 15:35:00 +0000

Description:
The more interactions an AI model has, the less reliable it becomes, experts find, as even the best only scored 80.9% and the worst scoring just 10.0%.

FULL STORY ======================================================================Copy link Facebook X Whatsapp Reddit Pinterest Flipboard Threads Email Share this article 0 Join the conversation Follow us Add us as a preferred source on Google Newsletter Subscribe to our newsletter Microsoft researchers determine that current LLMs aren't good at long-running tasks More interactions and
less structure significantly reduce benchmark performance "Python is the only domain where most models are ready" New research from a trio of Microsoft workers has uncovered a fundamental issue that could be blocking effective agentic AI -namely that most AI models can't actually reliably handle long-running workflows.

To quantify their findings, the researchers introduced a new DELEGATE-52 benchmark to provide metrics across 52 sectors, including coding, accounting, science and more. Ultimately, the paper concluded current LLMs "introduce sparse but severe errors that silently corrupt documents, compounding over long interaction." Latest Videos From You may like Even the most advanced AI models fail more often than you think on structured outputs Studies show top AI models go to 'extraordinary lengths' to stay active AI can summarize meetings, but heres what it still cant do in 2026 AI isn't that good at long-running tasks, yet The study goes into some of the latest AI models including Gemini 3.1 Pro, Claude 4.6 Opus and GPT-5.4. It found that even
they "corrupt an average of 25% of document content by the end of long workflows," with lesser models even more likely to get things wrong.

The DELEGATE-52 benchmark uses real documents at around 15K tokens in length and introduced 5-10 complex editing tasks with a "round-trip relay
simulation" that asks AI to perform a transformation then reverse it. This allows the researchers to measure how effectively each model reconstructs the documents back to their original forms.

Highly structured and programmatic areas were where the models performed
best, with the Microsoft researchers concluding that "Python is the only domain where most models are ready." Conversely, natural language workflows, creative areas and semi-structured documents saw model models struggle.

The paper also uncovers that, the longer the token length, the more likely an AI model is to struggle. Are you a pro? Subscribe to our newsletter Sign up
to the TechRadar Pro newsletter to get all the top news, opinion, features
and guidance your business needs to succeed! Contact me with news and offers from other Future brands Receive email from us on behalf of our trusted partners or sponsors By submitting your information you agree to the Terms & Conditions and Privacy Policy and are aged 16 or over.

Where frontier models differed was not in their ability to eliminate errors just that they were able to delay errors. Some of the other models tested by Microsoft's researchers included a number of GPT-5 and GPT-4 generations, Claude options, Gemini models and one each from Mistral, xAI and Moonshot totalling 19 different models from six families.

Gemini 3.1 Pro took first place with a DELEGATE-52 benchmark score of 80.9% after 20 interactions; Claude 4.6 Opus (73.1%) and GPT-5.4 (71.5%) round out the top three, and GPT 5 Nano (10.0%) falls into last place.

In short, the paper concludes that today's AI models are not reliable enough to be trusted for long-running, autonomous workflows, highlighting key areas where model developers must focus on in the future and offering up yet
another benchmark to determine model capability.

Via The Register Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds.

======================================================================
Link to news story: https://www.techradar.com/pro/current-llms-introduce-substantial-errors-when-e diting-work-documents-microsoft-scientists-find-most-ai-models-struggle-with-l ong-running-tasks-so-maybe-dont-trust-them-completely-just-yet

--- Mystic BBS v1.12 A49 (Linux/64)
* Origin: tqwNet Technology News (1337:1/100)

Who's Online
Recent Visitors
- Guest
  Tuesday, April 21, 2026 16:53:10
  from Somewhere, Nj via Telnet
- Guest
  Wednesday, March 11, 2026 02:37:23
  from System via Telnet
- Guest
  Tuesday, March 10, 2026 13:14:27
  from System via Raw
- Guest
  Saturday, March 07, 2026 09:47:24
  from System via Raw

System Info

Sysop:	smooth0401
Location:	New Providence, NJ
Users:	4
Nodes:	4 (0 / 4)
Uptime:	493759:57:19
Calls:	359
Calls today:	1
Files:	749
D/L today:	282 files (26,201K bytes)
Messages:	62,883

'Current LLMs introduce substantial errors when editing work docu

Who's Online

Recent Visitors

System Info