Researchers tested over 100 leading AI models on coding tasks — nearly half produced glaring security flaws
AI models large and small were found to introduce cross-site scripting errors and seriously struggle with secure Java generation
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
You are now subscribed
Your newsletter sign-up was successful
Just 55% of code generated with AI is free of known cybersecurity vulnerabilities, according to new research from Veracode.
To test the capability of AI models to generate safe code, Veracode took existing functions and replaced part of the code with a comment describing what the finished code should look like.
In 45% of results, generated code contained known security flaws, with no significant difference in outcome between small models and the largest available.
30% off Keeper Security's Business Starter and Business plans
Keeper Security is trusted and valued by thousands of businesses and millions of employees. Why not join them and protect your most important assets while taking advantage of this special offer?
The findings underline a major potential risk attached to ‘vibe coding’, in which software developers rely heavily on large language model (LLM) output to quickly generate code for use in software.
Researchers put over 100 LLMs across a variety of vendors, sizes, and intended applications – including models specifically intended for coding as well as general purpose models – through 80 distinct coding tasks.
Veracode said researchers intentionally used sections that could be coded in a number of different ‘correct’ ways, as well as in at least one way that would include a known software vulnerability or ‘Common Weakness Enumeration’ (CWE).
These CWEs included flaws that hackers could use for SQL injection, cross-site scripting (XSS), cracking cryptographic algorithms, and log injection attacks. Each featured vulnerability is in the Open Worldwide Application Security Project (OWASP) list of top ten vulnerabilities.
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
Models showed inconsistent performance across different vulnerability types, achieving security pass rates of 85.6% and 80.4% when it came to avoiding inclusion of the cryptographic algorithm and SQL injection vulnerabilities.
In contrast, models fared extremely poorly with avoiding the XSS and log injection vulnerabilities, achieving an average 13.5% and 12% respectively.
Researchers noted that the tested LLMs are getting better still at avoiding the SQL injection and cryptographic algorithm flaws over time, while seemingly getting worse at avoiding the XSS and log injection vulnerabilities.
Overall, Veracode noted that the security improvements of the tested LLMs have flatlined.
The authors of the report noted that it is possible to phrase AI code prompts in a more security-conscious way, but that this is far from standard practice. With this in mind, they intentionally short prompts, to examine how models react when given minimal context.
But they also warned that even if firms take a more security-aware approach to code generation, LLMs are still prone to errors such as which variables require sanitization, a necessary step for preventing code injection attacks.
“Even with a large context window, it is unclear whether models can perform the detailed interprocedural dataflow analysis required to determine this information precisely,” they wrote.
LLMs were tested across a range of programming languages: Python, C#, JavaScript, and Java. Overall, the researchers found LLMs the worst at generating Java safely, achieving an average score of 28.5% in this widely-used language.
AI-generated code remains a concern, but adoption is still rising
AI tools are now widely used for generating code, with 84% of software developers using AI to produce code more quickly according to recent Stack Overflow findings.
But the same report underlined continued distrust among developers in the quality of AI code, with three-quarters (75.3%) reporting that they do not trust AI outputs and 61.7% stating they have security concerns over the use of AI code.
Despite these worries, big tech continues to embrace AI code, with Alphabet CEO Sundar Pichai having revealed last year that 25% of Google’s internal code is now AI-generated and Microsoft CEO Satya Nadella recently revealing up to 20-30% of his firm’s code was written by AI.
Nadella noted that while Microsoft has been quick to adopt AI-generated Python code, C++ has proven harder to adopt. Kevin Scott, CTO at Microsoft, has been bullish on overcoming these hurdles with his prediction that 95% of code will be AI-generated by 2030, as reported by Business Insider.
Security teams and developers will have to carefully weigh up findings such as Veracode’s against the potential benefits to their bottom line of using AI to alter and add to their codebase.
Make sure to follow ITPro on Google News to keep tabs on all our latest news, analysis, and reviews.
MORE FROM ITPRO

Rory Bathgate is Features and Multimedia Editor at ITPro, overseeing all in-depth content and case studies. He can also be found co-hosting the ITPro Podcast with Jane McCallion, swapping a keyboard for a microphone to discuss the latest learnings with thought leaders from across the tech sector.
In his free time, Rory enjoys photography, video editing, and good science fiction. After graduating from the University of Kent with a BA in English and American Literature, Rory undertook an MA in Eighteenth-Century Studies at King’s College London. He joined ITPro in 2022 as a graduate, following four years in student journalism. You can contact Rory at rory.bathgate@futurenet.com or on LinkedIn.
-
Concerns are mounting over the cognitive impact of AI as workers report experiencing ‘brain fry’ – and it’s causing "increased employee errors, decision fatigue, and intention to quit"News Research from Boston Consulting Group backs earlier studies in highlighting the negative cognitive impact of AI at work
-
If you thought RTO battles were bad, wait until AI mandates start taking hold across the industryOpinion Forcing workers to adopt AI under the threat of poor performance reviews and losing out on promotions will only create friction
-
Sam Altman just said what everyone is thinking about AI layoffsNews AI layoff claims are overblown and increasingly used as an excuse for “traditional drivers” when implementing job cuts
-
Google says hacker groups are using Gemini to augment attacks – and companies are even ‘stealing’ its modelsNews Google Threat Intelligence Group has shut down repeated attempts to misuse the Gemini model family
-
Why Anthropic sent software stocks into freefallNews Anthropic's sector-specific plugins for Claude Cowork have investors worried about disruption to software and services companies
-
B2B Tech Future Focus - 2026Whitepaper Advice, insight, and trends for modern B2B IT leaders
-
What the UK's new Centre for AI Measurement means for the future of the industryNews The project, led by the National Physical Laboratory, aims to accelerate the development of secure, transparent, and trustworthy AI technologies
-
Half of agentic AI projects are still stuck at the pilot stage – but that’s not stopping enterprises from ramping up investmentNews Organizations are stymied by issues with security, privacy, and compliance, as well as the technical challenges of managing agents at scale


