The world's 'first AI software engineer' isn't living up to expectations: Cognition AI's 'Devin' assistant was touted as a game changer for developers, but so far it's fumbling tasks and struggling to compete with human workers
Devin failed to complete most tasks given to it by researchers
Devin, a coding assistant hailed as the world’s 'first AI software engineer’, was given 20 coding tasks – it managed to complete just three, taking longer than expected and going down strange routes to achieve its goals.
The AI coding tool, developed by Cognition AI, was hailed as a transformative solution to help streamline software development when it was unveiled last year.
Costing around $500 per month, the AI assistant works via Slack so it feels like chatting to a colleague. At the time, Cognition showed a demo of Devin picking up jobs on Upwork, a freelancing platform that is used by software engineers to find work.
However, the results haven't been replicable by third-party researchers, according to reports, with one software developer picking apart the Upwork claims and AI researchers assessing Devin found it lacking.
Devin was framed as a game changer AI tool
At Devin's launch last year, Cognition claimed that the tool could "make money taking on messy Upwork tasks," sharing a video purporting to show just that.
But software developer Carl Brown posted his own video in response, arguing that the company was not telling the truth about the tool's abilities, revealing what "Devin was supposed to do, what it actually managed to do instead, and how bad of a job that it did."
Brown noted that it took 36 minutes to do the task himself, and six hours for Devin to fail to do it.
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
Cognition's claims about Devin were also tested by a team of researchers at Answer.AI, and their results were closer to Brown's than what the original blog post claimed, achieving only three of 20 tasks.
There were some "early wins", however. Devin could pull a Notion database into Google Sheets with "surprising competence", they noted, completing the task in an hour with only a few minutes of human interaction.
The code worked, but was "a bit verbose." Another task, building a planet tracker, was similarly successful.
"This felt like a glimpse into the future — an AI that could handle the 'glue code' tasks that consume so much developer time.
More complicated tasks started to raise challenges, or as the researchers said: "as we scaled up our testing, cracks appeared."
"Tasks that seemed straightforward often took days rather than hours, with Devin getting stuck in technical dead-ends or producing overly complex, unusable solutions," they noted. "Even more concerning was Devin’s tendency to press forward with tasks that weren’t actually possible."
Over a month, they tasked Devin with creating new projects from scratch, performing research and analyzing or modifying existing projects, but out of 20 such tasks, just three were successful.
"The most frustrating aspect wasn’t the failures themselves - all tools have limitations - but rather how much time we spent trying to salvage these attempts," they said.
How to use Devin
That's a far cry from what was advertised when the AI assistant was first unveiled in March of last year. A blog post on Cognition's website claimed Devin could take on basic tasks for software engineers, allowing them to focus on bigger problems.
The website says Devin can find and fix bugs, build and deploy an entire app end-to-end, and even train and fine-tune an AI model.
"With our advances in long-term reasoning and planning, Devin can plan and execute complex engineering tasks requiring thousands of decisions," the company said. "Devin can recall relevant context at every step, learn over time, and fix mistakes."
Cognition hasn't yet replied to a request for comment from ITPro, but its own blog post does give some context to how the system could be used more successfully than these tests suggest.
RELATED WHITEPAPER
The company says Devin "can be an all-purpose tool", but recommends starting with smaller tasks such as simple bugs. Notably, the company said that it works best when you "give Devin tasks that you know how to do yourself" and tell the tool how to test or check its own work.
Thereafter, Devin can prove beneficial in helping to break down large tasks into smaller ones that will take less than three hours.
Given Answer.AI's success using Devin for smaller "glue code" tasks, perhaps such advice about starting small should be heeded.
Indeed, this research challenging the usefulness of the current crop of AI software assistants comes as Meta founder Mark Zuckerberg has predicted that AI will be doing the work of mid-level engineers this year — but with some serious caveats.
"In the beginning it’ll be really expensive to run, then you can get it to be more efficient and then over time we’ll get to the point where a lot of the code in our apps and including the AI that we generate is actually going to be built by AI engineers instead of people engineers," he said.
Freelance journalist Nicole Kobie first started writing for ITPro in 2007, with bylines in New Scientist, Wired, PC Pro and many more.
Nicole the author of a book about the history of technology, The Long History of the Future.
-
AI-generated code is now the cause of one-in-five breaches – but developers and security leaders alike are convinced the technology will come good eventuallyNews AI coding tools now write 24% of production code globally, but it's risky and causing issues for developers and security practitioners alike.
-
Anthropic’s new Claude Code web portal aims to make AI coding even more accessibleNews Claude Code for web runs entirely in a user’s browser of choice rather than in a command-line interface and can be connected directly to chosen GitHub repositories.
-
The UK’s aging developer workforce needs a ‘steady pipeline’ of talent to meet future demand – but AI’s impact on entry-level jobs and changing skills requirements mean it could be fighting an uphill battleAnalysis With the average age of developers in the UK rising, concerns are growing about the flow of talent into the sector
-
AI coding really isn't living up to expectations – "the savings have been unremarkable" but not for the reason you might thinkNews Companies are focusing too heavily on simple AI coding tasks, and not overhauling wider business processes
-
UK government programmers trialed AI coding assistants from Microsoft, GitHub, and Google – here's what they foundNews Developers participating in a trial of AI coding tools from Google, Microsoft, and GitHub reported big time savings, with 58% saying they now couldn't work without them.
-
Senior developers are all in on vibe coding, but junior staff lack the experience to spot critical flawsNews Experienced developers are far more confident in using AI-generated code
-
Microsoft says AI is finally having a 'meaningful impact' on developer productivity – and 80% 'would be sad if they could no longer use it'News Researchers at Microsoft wanted to demystify how AI is being used by software developers – their findings show the benefits are finally becoming clear.
-
Google's new Jules coding agent is free to use for anyone – and it just got a big update to prevent bad code outputNews Jules came out of beta and launched publicly earlier this month, but it's already had a big update aimed at improving code quality and safety.


