12,000 API keys and passwords were found in a popular AI training dataset – experts say the issue is down to poor identity management
Increasingly complex identity management is driving complacency among devs and leading to hardcoded secrets exposing API keys
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
You are now subscribed
Your newsletter sign-up was successful
The discovery of almost 12,000 valid secrets in the archive of a popular AI training dataset is the result of the industry’s inability to keep up with the complexities of identity management, experts have told ITPro.
Researchers at Truffle Security found nearly 12,000 ‘live’ API keys and passwords when analysing the Common Crawl archive used to train open source LLMs such as DeepSeek
The researchers trawled through the December 2024 Common Crawl archive, consisting of 400TB of web data gathered from 2.67 billion web pages, and found 11,908 live secrets using their open source secret scanner, TruffleHog.
The report found these secrets had been hardcoded in the front-end HTML and JavaScript, rather than using server-side environment variables.
In total, TruffleHog found 219 different secret types in the archive including API keys for AWS and Walkscore.
Mailchimp API keys were the most frequently leaked secret, however, with the researchers finding 1,500 unique keys hardcoded into HTML forms and JavaScript snippets.
The report warned that this exposure of LLMs to examples of code containing hardcoded secrets could lead to them suggesting these secrets in their model outputs, although it noted fine-tuning, alignment techniques, prompt context, and alternative training data can mitigate this risk.
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
Nonetheless, malicious actors could use the keys for phishing campaigns, data exfiltration, and brand impersonation, researchers said.
Industry on an “unsustainable path for growing infrastructure complexity”
IT leaders have warned that an increasingly complex technology landscape, combined with an ever expanding number of machine identities for organizations to manage, is a major factor in why these secrets have been exposed.
As developers struggle to manage complex machine identities, human errors like hardcoding secrets become much more common, leading to them turning up in AI training data scraped by web crawlers as in the Common Crawl case.
Speaking to ITPro, Darren Meyer, security research advocate at Checkmarx, suggested this problem has been around for a while and is only set to get worse as organizations increase the number of machine identities they need to manage by adopting new technologies.
“This problem of leaking credentials and related secrets because of machine-to-machine authentication requirements is a long-standing and growing issue,” he said.
“New use cases like training AI models on otherwise private data will definitely increase the likelihood that secrets leak, as well as the impact of those leaks.”
Ev Kontsevoy, CEO at Teleport, added he was not surprised by these findings and that at the current rate the industry is on an “unsustainable path for growing infrastructure complexity”. Kontsevoy further warned that this will continue to happen unless the industry changes its understanding of identity.
"It’s never surprising seeing that secrets like APIs keys are making their way into places they shouldn’t be. We are on an unsustainable path for growing infrastructure complexity that will continuously expose secrets and waste the productivity of engineers, unless we rethink our approach to identity and security,” he argued.
“Every emerging technology being brought into production is on one hand critical for businesses to stay competitive – because your competitors are adopting that tech as well – but on the other, it represents yet another attack vector,” Kontsevoy added.
“Every single layer of a technology listening on the network has its own idea of users, its own role-based access control, its own configuration and configuration syntax. That requires expertise, which most teams today lack to secure every little thing they have, and yet the future keeps bringing new things they need to secure.”
Organizations have their work cut out for them
Meyer said addressing this problem will not be easy and organizations have two relatively stark challenges ahead of them to avoid exposing their secrets, whether through AI training data or otherwise.
“Organizations need to do two relatively challenging things. Firstly, organisations should be seeking to avoid using long-life secrets for machine-to-machine authentication, replacing such systems with OIDC or other similar systems that use short-lived tokens wherever possible,” he told ITPro.
“This reduces the impact of a secrets leak, as the leaked secrets are much more likely to have expired by the time an attacker gets hold of them, making them useless."
“Secondly, they should have strong processes around AI adoption to ensure that AI agents and related systems don’t have access to sensitive data in most cases. This type of control has to happen at every stage, from alerting about secrets being leaked during development to carefully monitoring the data being fed into AI models during training and operation.”
He added that AI agents that require varying levels of access to potentially sensitive areas of their IT environment will introduce further identity management challenges.
“In cases where the purpose of the AI agent requires access to secrets or other sensitive data, those adoption processes should ensure that access to the model and any implementing applications is tightly controlled.”
MORE FROM ITPRO

Solomon Klappholz is a former staff writer for ITPro and ChannelPro. He has experience writing about the technologies that facilitate industrial manufacturing, which led to him developing a particular interest in cybersecurity, IT regulation, industrial infrastructure applications, and machine learning.
-
Salesforce targets unified customer support automation with Agentforce Contact CenterNews Combining AI agents, telephony, and CRM, Salesforce is making a firm case for automated customer interactions and controlled
-
Building resilience in global tech trading: Lessons from leading circular marketsIndustry Insights Circular tech trading builds resilience through diversification, quality standards, and trusted partnerships
-
Is your new hire an AI clone? Microsoft says North Korean hackers are using AI to impersonate job seekers and steal company secretsNews The groups are increasingly using face-changing or voice-changing software to make their fake identities more plausible
-
CrowdStrike says AI is officially supercharging cyber attacks: Average breakout times hit just 29 minutes in 2025, 65% faster than in 2024 – and some attacks take just secondsNews Cyber criminals are actively exploiting AI systems and injecting malicious prompts into legitimate generative AI tools
-
Using AI to generate passwords is a terrible idea, experts warnNews Researchers have warned the use of AI-generated passwords puts users and businesses at risk
-
Harnessing AI to secure the future of identityIndustry Insights Channel partners must lead on securing AI identities through governance and support
-
‘They are able to move fast now’: AI is expanding attack surfaces – and hackers are looking to reap the same rewards as enterprises with the technologyNews Potent new malware strains, faster attack times, and the rise of shadow AI are causing havoc
-
CISA’s interim chief uploaded sensitive documents to a public version of ChatGPT – security experts explain why you should never do thatNews The incident at CISA raises yet more concerns about the rise of ‘shadow AI’ and data protection risks
-
AI is “forcing a fundamental shift” in data privacy and governanceNews Organizations are working to define and establish the governance structures they need to manage AI responsibly at scale – and budgets are going up
-
Supply chain and AI security in the spotlight for cyber leaders in 2026News Organizations are sharpening their focus on supply chain security and shoring up AI systems