The world's 'first AI software engineer' isn't living up to expectations: Cognition AI's 'Devin' assistant was touted as a game changer for developers, but so far it's fumbling tasks and struggling to compete with human workers
Devin failed to complete most tasks given to it by researchers


Devin, a coding assistant hailed as the world’s 'first AI software engineer’, was given 20 coding tasks – it managed to complete just three, taking longer than expected and going down strange routes to achieve its goals.
The AI coding tool, developed by Cognition AI, was hailed as a transformative solution to help streamline software development when it was unveiled last year.
Costing around $500 per month, the AI assistant works via Slack so it feels like chatting to a colleague. At the time, Cognition showed a demo of Devin picking up jobs on Upwork, a freelancing platform that is used by software engineers to find work.
However, the results haven't been replicable by third-party researchers, according to reports, with one software developer picking apart the Upwork claims and AI researchers assessing Devin found it lacking.
Devin was framed as a game changer AI tool
At Devin's launch last year, Cognition claimed that the tool could "make money taking on messy Upwork tasks," sharing a video purporting to show just that.
But software developer Carl Brown posted his own video in response, arguing that the company was not telling the truth about the tool's abilities, revealing what "Devin was supposed to do, what it actually managed to do instead, and how bad of a job that it did."
Brown noted that it took 36 minutes to do the task himself, and six hours for Devin to fail to do it.
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
Cognition's claims about Devin were also tested by a team of researchers at Answer.AI, and their results were closer to Brown's than what the original blog post claimed, achieving only three of 20 tasks.
There were some "early wins", however. Devin could pull a Notion database into Google Sheets with "surprising competence", they noted, completing the task in an hour with only a few minutes of human interaction.
The code worked, but was "a bit verbose." Another task, building a planet tracker, was similarly successful.
"This felt like a glimpse into the future — an AI that could handle the 'glue code' tasks that consume so much developer time.
More complicated tasks started to raise challenges, or as the researchers said: "as we scaled up our testing, cracks appeared."
"Tasks that seemed straightforward often took days rather than hours, with Devin getting stuck in technical dead-ends or producing overly complex, unusable solutions," they noted. "Even more concerning was Devin’s tendency to press forward with tasks that weren’t actually possible."
Over a month, they tasked Devin with creating new projects from scratch, performing research and analyzing or modifying existing projects, but out of 20 such tasks, just three were successful.
"The most frustrating aspect wasn’t the failures themselves - all tools have limitations - but rather how much time we spent trying to salvage these attempts," they said.
How to use Devin
That's a far cry from what was advertised when the AI assistant was first unveiled in March of last year. A blog post on Cognition's website claimed Devin could take on basic tasks for software engineers, allowing them to focus on bigger problems.
The website says Devin can find and fix bugs, build and deploy an entire app end-to-end, and even train and fine-tune an AI model.
"With our advances in long-term reasoning and planning, Devin can plan and execute complex engineering tasks requiring thousands of decisions," the company said. "Devin can recall relevant context at every step, learn over time, and fix mistakes."
Cognition hasn't yet replied to a request for comment from ITPro, but its own blog post does give some context to how the system could be used more successfully than these tests suggest.
RELATED WHITEPAPER
The company says Devin "can be an all-purpose tool", but recommends starting with smaller tasks such as simple bugs. Notably, the company said that it works best when you "give Devin tasks that you know how to do yourself" and tell the tool how to test or check its own work.
Thereafter, Devin can prove beneficial in helping to break down large tasks into smaller ones that will take less than three hours.
Given Answer.AI's success using Devin for smaller "glue code" tasks, perhaps such advice about starting small should be heeded.
Indeed, this research challenging the usefulness of the current crop of AI software assistants comes as Meta founder Mark Zuckerberg has predicted that AI will be doing the work of mid-level engineers this year — but with some serious caveats.
"In the beginning it’ll be really expensive to run, then you can get it to be more efficient and then over time we’ll get to the point where a lot of the code in our apps and including the AI that we generate is actually going to be built by AI engineers instead of people engineers," he said.
Freelance journalist Nicole Kobie first started writing for ITPro in 2007, with bylines in New Scientist, Wired, PC Pro and many more.
Nicole the author of a book about the history of technology, The Long History of the Future.
-
Everything we know about the Plex data breach so far
News Plex advised users to sign out of any connected devices that are currently logged in and enable two-factor authentication if they haven’t already.
-
Mainframes are back in vogue
News Mainframes are back in vogue, according to research from Kyndryl, with enterprises ramping up hybrid IT strategies and generative AI adoption.
-
Senior developers are all in on vibe coding, but junior staff lack the experience to spot critical flaws
News Experienced developers are far more confident in using AI-generated code
-
Microsoft says AI is finally having a 'meaningful impact' on developer productivity – and 80% 'would be sad if they could no longer use it'
News Researchers at Microsoft wanted to demystify how AI is being used by software developers – their findings show the benefits are finally becoming clear.
-
Google's new Jules coding agent is free to use for anyone – and it just got a big update to prevent bad code output
News Jules came out of beta and launched publicly earlier this month, but it's already had a big update aimed at improving code quality and safety.
-
Using an older version of Python? You’re leaving ‘money and performance on the table’ if you don’t upgrade – and missing out on big developer efficiency gains
News New research from JetBrains shows a majority of enterprises are using a version of Python that’s a year or more older – and it's having a big impact on efficiency and performance.
-
Developers say AI can code better than most humans – but there's a catch
News A new survey suggests AI coding tools are catching up on human capabilities
-
84% of software developers are now using AI, but nearly half 'don't trust' the technology over accuracy concerns
News AI coding tools are delivering benefits for developers, but they’re still worried about security and compliance
-
Think AI coding tools are speeding up work? Think again – they’re actually slowing developers down
News AI coding tools may be hindering the work of experienced software developers, according to new research
-
Atlassian says AI has created an 'unexpected paradox' for software developers – they're saving over 10 hours a week, but they’re still overworked and losing an equal amount of time due to ‘organizational inefficiencies’
News While AI is helping save developers over 10 hours a week, these productivity boosts are being offset by growing workloads and poor operational efficiency, Atlassian says.