12,000 API keys and passwords were found in a popular AI training dataset – experts say the issue is down to poor identity management
Increasingly complex identity management is driving complacency among devs and leading to hardcoded secrets exposing API keys
The discovery of almost 12,000 valid secrets in the archive of a popular AI training dataset is the result of the industry’s inability to keep up with the complexities of identity management, experts have told ITPro.
Researchers at Truffle Security found nearly 12,000 ‘live’ API keys and passwords when analysing the Common Crawl archive used to train open source LLMs such as DeepSeek
The researchers trawled through the December 2024 Common Crawl archive, consisting of 400TB of web data gathered from 2.67 billion web pages, and found 11,908 live secrets using their open source secret scanner, TruffleHog.
The report found these secrets had been hardcoded in the front-end HTML and JavaScript, rather than using server-side environment variables.
In total, TruffleHog found 219 different secret types in the archive including API keys for AWS and Walkscore.
Mailchimp API keys were the most frequently leaked secret, however, with the researchers finding 1,500 unique keys hardcoded into HTML forms and JavaScript snippets.
The report warned that this exposure of LLMs to examples of code containing hardcoded secrets could lead to them suggesting these secrets in their model outputs, although it noted fine-tuning, alignment techniques, prompt context, and alternative training data can mitigate this risk.
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
Nonetheless, malicious actors could use the keys for phishing campaigns, data exfiltration, and brand impersonation, researchers said.
Industry on an “unsustainable path for growing infrastructure complexity”
IT leaders have warned that an increasingly complex technology landscape, combined with an ever expanding number of machine identities for organizations to manage, is a major factor in why these secrets have been exposed.
As developers struggle to manage complex machine identities, human errors like hardcoding secrets become much more common, leading to them turning up in AI training data scraped by web crawlers as in the Common Crawl case.
Speaking to ITPro, Darren Meyer, security research advocate at Checkmarx, suggested this problem has been around for a while and is only set to get worse as organizations increase the number of machine identities they need to manage by adopting new technologies.
“This problem of leaking credentials and related secrets because of machine-to-machine authentication requirements is a long-standing and growing issue,” he said.
“New use cases like training AI models on otherwise private data will definitely increase the likelihood that secrets leak, as well as the impact of those leaks.”
Ev Kontsevoy, CEO at Teleport, added he was not surprised by these findings and that at the current rate the industry is on an “unsustainable path for growing infrastructure complexity”. Kontsevoy further warned that this will continue to happen unless the industry changes its understanding of identity.
"It’s never surprising seeing that secrets like APIs keys are making their way into places they shouldn’t be. We are on an unsustainable path for growing infrastructure complexity that will continuously expose secrets and waste the productivity of engineers, unless we rethink our approach to identity and security,” he argued.
“Every emerging technology being brought into production is on one hand critical for businesses to stay competitive – because your competitors are adopting that tech as well – but on the other, it represents yet another attack vector,” Kontsevoy added.
“Every single layer of a technology listening on the network has its own idea of users, its own role-based access control, its own configuration and configuration syntax. That requires expertise, which most teams today lack to secure every little thing they have, and yet the future keeps bringing new things they need to secure.”
Organizations have their work cut out for them
Meyer said addressing this problem will not be easy and organizations have two relatively stark challenges ahead of them to avoid exposing their secrets, whether through AI training data or otherwise.
“Organizations need to do two relatively challenging things. Firstly, organisations should be seeking to avoid using long-life secrets for machine-to-machine authentication, replacing such systems with OIDC or other similar systems that use short-lived tokens wherever possible,” he told ITPro.
“This reduces the impact of a secrets leak, as the leaked secrets are much more likely to have expired by the time an attacker gets hold of them, making them useless."
“Secondly, they should have strong processes around AI adoption to ensure that AI agents and related systems don’t have access to sensitive data in most cases. This type of control has to happen at every stage, from alerting about secrets being leaked during development to carefully monitoring the data being fed into AI models during training and operation.”
He added that AI agents that require varying levels of access to potentially sensitive areas of their IT environment will introduce further identity management challenges.
“In cases where the purpose of the AI agent requires access to secrets or other sensitive data, those adoption processes should ensure that access to the model and any implementing applications is tightly controlled.”
MORE FROM ITPRO

Solomon Klappholz is a former staff writer for ITPro and ChannelPro. He has experience writing about the technologies that facilitate industrial manufacturing, which led to him developing a particular interest in cybersecurity, IT regulation, industrial infrastructure applications, and machine learning.
-
Gender diversity improvements could be the key to tackling the UK's AI skills shortageNews Encouraging more women to pursue tech careers could plug huge gaps in the AI workforce
-
Researchers claim Salt Typhoon masterminds learned their trade at Cisco Network AcademyNews The Salt Typhoon hacker group has targeted telecoms operators and US National Guard networks in recent years
-
Trend Micro issues warning over rise of 'vibe crime' as cyber criminals turn to agentic AI to automate attacksNews Trend Micro is warning of a boom in 'vibe crime' - the use of agentic AI to support fully-automated cyber criminal operations and accelerate attacks.
-
NCSC issues urgent warning over growing AI prompt injection risks – here’s what you need to knowNews Many organizations see prompt injection as just another version of SQL injection - but this is a mistake
-
AWS CISO Amy Herzog thinks AI agents will be a ‘boon’ for cyber professionals — and teams at Amazon are already seeing huge gainsNews AWS CISO Amy Herzog thinks AI agents will be a ‘boon’ for cyber professionals, and the company has already unlocked significant benefits from the technology internally.
-
HPE selects CrowdStrike to safeguard high-performance AI workloadsNews The security vendor joins HPE’s Unleash AI partner program, bringing Falcon security capabilities to HPE Private Cloud AI
-
Microsoft opens up Entra Agent ID preview with new AI featuresNews Microsoft Entra Agent ID aims to help manage influx of AI agents using existing tools
-
GitHub is awash with leaked AI company secrets – API keys, tokens, and credentials were all found out in the openNews Wiz research suggests AI leaders need to clean up their act when it comes to secrets leaking
-
Using AI to code? Watch your security debtnews Black Duck research shows faster development may be causing risks for companies
-
Generative AI attacks are accelerating at an alarming rateNews Two new reports from Gartner highlight the new AI-related pressures companies face, and the tools they are using to counter them