‘Frontier models are still unable to solve the majority of tasks’: AI might not replace software engineers just yet – OpenAI researchers found leading models and coding tools still lag behind humans on basic tasks
Large language models struggle to identify root causes or provide comprehensive solutions
AI might not replace software engineers just yet as new research from OpenAI reveals ongoing weaknesses in the technology.
Having created a benchmark dubbed ‘SWE-Lancer’ to evaluate AI’s effectiveness at completing software engineering and managerial tasks, researchers concluded that the technology is lacking.
“We evaluate model performance and find that frontier models are still unable to solve the majority of tasks,” researchers said.
Researchers found that, while AI excels in certain areas, it is limited in others. For example, AI agents are skilled at localizing problems but bad at working out what the root cause is.
While they can pinpoint the location of an issue with speed and use search capabilities to access necessary repositories faster than humans can, their understanding is limited in terms of how an issue spans across different components and files.
This frequently leads to solutions that are incorrect or insufficiently comprehensive, and agents can often fail by not finding the right file or location to edit.
In a comparison between two OpenAI models, o1 and GPT-4o, and Claude’s 3.5 Sonnet model, researchers found they all failed to entirely solve one particular user interface (UI) problem.
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
While o1 solved the basic issue, it missed a range of others, and GPT-4o failed to solve even the initial problem. Sonnet was quick to identify the root cause of the issue and fix the bug, but the solution was not comprehensive and did not pass the researcher’s end-to-end tests.
All told, researchers said that while AI coding tools have the capacity to make software engineering more productive, but that users need to be wary of the potential flaws in AI-generated code.
Are AI coding tools more trouble than they’re worth?
While businesses are ramping up the use of AI coding tools, there have been plenty of warning signs to make firms stop and consider whether the tools are worth it.
Research from Harness earlier this year found that many developers are becoming increasingly bogged down with manual tasks and code remediation due to the increased use of AI coding tools.
The study noted that while these tools may offer huge benefits to software engineers, experts say they are still littered with weaknesses and lack some of the capabilities of human engineers.
“While these tools can boost efficiency, in their current state they often result in a surge of errors, security vulnerabilities, and downstream manual work that burdens developers," Sheila Flavell, COO of FDM Group, told ITPro.
The risk of vulnerabilities and malicious code being introduced into organizations is also significantly higher when AI coding tools are used, according to Shobhit Gautam, security solutions architect at HackerOne.
“AI-generated code is not guaranteed to follow security guidelines and best practices as defined by the organization standards. As the code is generated from LLMs, there is a possibility that third-party components may be used in the code and go unnoticed,” Gautam told ITPro.
RELATED WHITEPAPER
“Aside from the risk of copyright infringement, the code hasn’t been through the company’s validation testing and peer reviews, potentially resulting in unchecked vulnerabilities,” Gautam added.
An overreliance on AI coding tools may also be eroding the skills of human programmers, with research from education platform O’Reilly finding that interest in traditional programming languages is in decline.
Similarly, a post from tech blogger and programmer Namanyay Goel sparked debate on this topic recently when Goel claimed junior developers lack coding skills owing to a heightened use of automated AI tooling.
How can businesses use these tools effectively?
Despite concerns, there are clear signs AI coding tools are delivering value for both software engineers and enterprises. GitHub research from last year revealed AI coding tools have helped engineers deliver more secure software, better quality code, and the adoption of new languages.
With this in mind, firms need to prioritize certain processes to deliver success with AI tools. Flavell said businesses need to put upskilling front and center, as well as improving code reviews and quality assurance.
“It is essential that organizations create and implement governance processes to manage the use of AI generated code,” Gautam added.
“When it comes to coding, AI tools and human input will all play their part. Organizations gain the best of both worlds when they integrate these two together. Human Intelligence is essential to tailor coding to specific requirements, and AI can help experts increase their efficiency.”
MORE FROM ITPRO
- Can AI code generation really replace human developers?
- AI-generated code risks: What CISOs need to know
- The world's 'first AI software engineer' isn't living up to expectations

George Fitzmaurice is a former Staff Writer at ITPro and ChannelPro, with a particular interest in AI regulation, data legislation, and market development. After graduating from the University of Oxford with a degree in English Language and Literature, he undertook an internship at the New Statesman before starting at ITPro. Outside of the office, George is both an aspiring musician and an avid reader.
-
Trump's AI executive order could leave US in a 'regulatory vacuum'News Citing a "patchwork of 50 different regulatory regimes" and "ideological bias", President Trump wants rules to be set at a federal level
-
TPUs: Google's home advantageITPro Podcast How does TPU v7 stack up against Nvidia's latest chips – and can Google scale AI using only its own supply?
-
Anthropic says MCP will stay 'open, neutral, and community-driven' after donating project to Linux FoundationNews The AAIF aims to standardize agentic AI development and create an open ecosystem for developers
-
Atlassian just launched a new ChatGPT connector feature for Jira and Confluence — here's what users can expectNews The company says the new features will make it easier to summarize updates, surface insights, and act on information in Jira and Confluence
-
AWS says ‘frontier agents’ are here – and they’re going to transform software developmentNews A new class of AI agents promises days of autonomous work and added safety checks
-
Breaking boundaries: Empowering channel partners to unite DevOps and MLOps for a stronger software supply chainIndustry Insights Unifying DevOps and MLOps speeds delivery, strengthens governance, and improves software supply chain efficiency
-
Google CEO Sundar Pichai thinks software development is 'exciting again' thanks to vibe coding — but developers might disagreeNews Google CEO Sundar Pichai claims software development has become “exciting again” since the rise of vibe coding, but some devs are still on the fence about using AI to code.
-
Open source AI models are cheaper than closed source competitors and perform on par, so why aren’t enterprises flocking to them?Analysis Open source AI models often perform on-par with closed source options and could save enterprises billions in cost savings, new research suggests, yet uptake remains limited.
-
‘Slopsquatting’ is a new risk for vibe coding developers – but it can be solved by focusing on the fundamentalsNews Malicious packages in public code repositories can be given a sheen of authenticity via AI tools
-
Microsoft’s Windows chief wants to turn the operating system into an ‘agentic OS' – users just want reliability and better performanceNews While Microsoft touts an AI-powered future for Windows, users want the tech giant to get back to basics