Even the most advanced AI models fail more often than you think on structured outputs raising doubts about the effectiveness of coding assistants
Date:
Sun, 22 Mar 2026 17:05:00 +0000
Description:
Large language models struggle with structured outputs, achieving only 75% accuracy on complex tasks, leaving reliability concerns for developers.
Report finds AI coding assistants regularly fail one in four structured-output tasks
Even advanced proprietary models only reach approximately 75% accuracy
Open source AI models perform worse, averaging closer to 65% reliability
The promise of artificial intelligence as a
tireless coding assistant has encountered a significant roadblock after new research claimed such tools can experience a range of issues.
A recent study from the University of Waterloo found AI struggles with
software development, with even the most advanced models failing on one in
four structured-output tasks. The research evaluated 11 large language models across 18 different structured formats and 44 tasks to test how well the systems could follow predefined rules, finding a clear disparity between performance on text-based tasks and outputs involving multimedia or complex structures. Benchmarking reveals a troubling reliability gap --
While text-related tasks were generally handled with moderate success, tasks requiring image, video, or website generation proved far more problematic.
Accuracy in these areas dropped sharply, raising questions about how these AI tools can be integrated safely into professional workflows.
With this kind of study, we want to measure not only the syntax of the code that is, whether its following the set rules but also whether the outputs produced for various tasks were accurate, said Dongfu Jiang, a PhD student
and co-first author of the study.
Structured outputs, designed to impose format consistency through JSON, XML,
or Markdown, were intended to make AI responses more reliable for developers. AI companies, including OpenAI, Google , and Anthropic, introduced structured outputs to force responses into predictable formats.
The Waterloo research suggests this approach has not yet delivered the level
of dependability developers require.
Waterloos benchmarking revealed even the most advanced proprietary models reached only about 75% accuracy, while open source alternatives performed closer to 65%. What to read next Testing AI is not like testing software and most companies haven't figured that out yet 5 AI myths taken apart AI can summarize meetings, but heres what it still cant do in 2026
These results suggest that, despite improvements, AI systems still make significant errors that cannot be ignored in professional development environments.
The report emphasized the need for human oversight, noting, Developers might have these agents working for them, but they still need significant human supervision.
Although structured outputs are a step forward from free-form natural
language responses, errors remain common.
The technology is not yet robust enough to operate independently in complex development scenarios.
One might reasonably question whether the industrys enthusiasm for AI and
vibe coding assistants has outpaced the actual capabilities of the underlying technology.
Even the most advanced models demonstrate a significant failure rate on structured tasks, revealing a wide gap between marketing claims and actual performance.
Therefore, for now, developers should treat these tools as experimental aids rather than autonomous colleagues.
Link to news story:
https://www.techradar.com/pro/even-the-most-advanced-ai-models-fail-more-often -than-you-think-on-structured-outputs-raising-doubts-about-the-effectiveness-o f-coding-assistants
$$
---
� Synchronet � CAPCITY2 * Capitol City Online