‘Frontier models are still unable to solve the majority of tasks’: AI might not replace software engineers just yet – OpenAI researchers found leading models and coding tools still lag behind humans on basic tasks
Large language models struggle to identify root causes or provide comprehensive solutions
AI might not replace software engineers just yet as new research from OpenAI reveals ongoing weaknesses in the technology.
Having created a benchmark dubbed ‘SWE-Lancer’ to evaluate AI’s effectiveness at completing software engineering and managerial tasks, researchers concluded that the technology is lacking.
“We evaluate model performance and find that frontier models are still unable to solve the majority of tasks,” researchers said.
Researchers found that, while AI excels in certain areas, it is limited in others. For example, AI agents are skilled at localizing problems but bad at working out what the root cause is.
While they can pinpoint the location of an issue with speed and use search capabilities to access necessary repositories faster than humans can, their understanding is limited in terms of how an issue spans across different components and files.
This frequently leads to solutions that are incorrect or insufficiently comprehensive, and agents can often fail by not finding the right file or location to edit.
In a comparison between two OpenAI models, o1 and GPT-4o, and Claude’s 3.5 Sonnet model, researchers found they all failed to entirely solve one particular user interface (UI) problem.
Sign up today and you will receive a free copy of our Future Focus 2026 report - the leading resource for IT decision-maker insight on priorities and investment areas in AI, security and more.
While o1 solved the basic issue, it missed a range of others, and GPT-4o failed to solve even the initial problem. Sonnet was quick to identify the root cause of the issue and fix the bug, but the solution was not comprehensive and did not pass the researcher’s end-to-end tests.
All told, researchers said that while AI coding tools have the capacity to make software engineering more productive, but that users need to be wary of the potential flaws in AI-generated code.
Are AI coding tools more trouble than they’re worth?
While businesses are ramping up the use of AI coding tools, there have been plenty of warning signs to make firms stop and consider whether the tools are worth it.
Research from Harness earlier this year found that many developers are becoming increasingly bogged down with manual tasks and code remediation due to the increased use of AI coding tools.
The study noted that while these tools may offer huge benefits to software engineers, experts say they are still littered with weaknesses and lack some of the capabilities of human engineers.
“While these tools can boost efficiency, in their current state they often result in a surge of errors, security vulnerabilities, and downstream manual work that burdens developers," Sheila Flavell, COO of FDM Group, told ITPro.
The risk of vulnerabilities and malicious code being introduced into organizations is also significantly higher when AI coding tools are used, according to Shobhit Gautam, security solutions architect at HackerOne.
“AI-generated code is not guaranteed to follow security guidelines and best practices as defined by the organization standards. As the code is generated from LLMs, there is a possibility that third-party components may be used in the code and go unnoticed,” Gautam told ITPro.
RELATED WHITEPAPER
“Aside from the risk of copyright infringement, the code hasn’t been through the company’s validation testing and peer reviews, potentially resulting in unchecked vulnerabilities,” Gautam added.
An overreliance on AI coding tools may also be eroding the skills of human programmers, with research from education platform O’Reilly finding that interest in traditional programming languages is in decline.
Similarly, a post from tech blogger and programmer Namanyay Goel sparked debate on this topic recently when Goel claimed junior developers lack coding skills owing to a heightened use of automated AI tooling.
How can businesses use these tools effectively?
Despite concerns, there are clear signs AI coding tools are delivering value for both software engineers and enterprises. GitHub research from last year revealed AI coding tools have helped engineers deliver more secure software, better quality code, and the adoption of new languages.
With this in mind, firms need to prioritize certain processes to deliver success with AI tools. Flavell said businesses need to put upskilling front and center, as well as improving code reviews and quality assurance.
“It is essential that organizations create and implement governance processes to manage the use of AI generated code,” Gautam added.
“When it comes to coding, AI tools and human input will all play their part. Organizations gain the best of both worlds when they integrate these two together. Human Intelligence is essential to tailor coding to specific requirements, and AI can help experts increase their efficiency.”
MORE FROM ITPRO
- Can AI code generation really replace human developers?
- AI-generated code risks: What CISOs need to know
- The world's 'first AI software engineer' isn't living up to expectations

George Fitzmaurice is a former Staff Writer at ITPro and ChannelPro, with a particular interest in AI regulation, data legislation, and market development. After graduating from the University of Oxford with a degree in English Language and Literature, he undertook an internship at the New Statesman before starting at ITPro. Outside of the office, George is both an aspiring musician and an avid reader.
-
Asana wants every enterprise to have an AI ‘chief of staff’News The new Asana Dash tool was built to help guide and support teams through projects
-
Enterprises are shipping huge volumes of untested AI-generated code – experts warn it will cause major security issues and have huge financial repercussionsNews With speed routinely prioritized over quality, organizations often respond by taking shortcuts
-
AI might help speed up software development, but 81% of devs now spend more time reviewing code – and it’s creating an ‘invisible work’ trend that’s pushing teams to the limitNews While AI is improving productivity and efficiency, many developers are caught up in a vicious cycle of code reviews and bug hunting
-
AWS CEO Matt Garman is bullish on the future of SaaS — Amazon Quick shows there’s a ‘great business opportunity’ with AI-powered softwareNews Matt Garman said fears over the ‘SaaSpocalypse’ were overblown in February, now AWS is making big moves in the SaaS space
-
AI is coming to Ubuntu: Canonical exec teases future AI features and agentic workflow capabilities for version 26.10 — but on a ‘strictly opt-in basis’News A range of new AI features are coming to Ubuntu over the next year, according to maintainers, but only providing they’re of “sufficient maturity and quality”.
-
Everything you need to know about the GitHub Copilot pricing changesNews GitHub Copilot pricing changes mean users will be charged based on consumption, rather than a set number of credits
-
Developers are slacking on AI-generated code safety – here's why it could come back to haunt themNews While organizations are aware of the risks, many are spending little time or effort on tracking artifact versions, origins, and security attestations
-
Four things you need to know about GitHub's AI model training policy – including how to opt outNews Users of certain GitHub Copilot plans will have interaction data used to train AI models, but can opt out

