12,000 API keys and passwords were found in a popular AI training dataset – experts say the issue is down to poor identity management
Increasingly complex identity management is driving complacency among devs and leading to hardcoded secrets exposing API keys
The discovery of almost 12,000 valid secrets in the archive of a popular AI training dataset is the result of the industry’s inability to keep up with the complexities of identity management, experts have told ITPro.
Researchers at Truffle Security found nearly 12,000 ‘live’ API keys and passwords when analysing the Common Crawl archive used to train open source LLMs such as DeepSeek
The researchers trawled through the December 2024 Common Crawl archive, consisting of 400TB of web data gathered from 2.67 billion web pages, and found 11,908 live secrets using their open source secret scanner, TruffleHog.
The report found these secrets had been hardcoded in the front-end HTML and JavaScript, rather than using server-side environment variables.
In total, TruffleHog found 219 different secret types in the archive including API keys for AWS and Walkscore.
Mailchimp API keys were the most frequently leaked secret, however, with the researchers finding 1,500 unique keys hardcoded into HTML forms and JavaScript snippets.
The report warned that this exposure of LLMs to examples of code containing hardcoded secrets could lead to them suggesting these secrets in their model outputs, although it noted fine-tuning, alignment techniques, prompt context, and alternative training data can mitigate this risk.
Sign up today and you will receive a free copy of our Future Focus 2026 report - the leading resource for IT decision-maker insight on priorities and investment areas in AI, security and more.
Nonetheless, malicious actors could use the keys for phishing campaigns, data exfiltration, and brand impersonation, researchers said.
Industry on an “unsustainable path for growing infrastructure complexity”
IT leaders have warned that an increasingly complex technology landscape, combined with an ever expanding number of machine identities for organizations to manage, is a major factor in why these secrets have been exposed.
As developers struggle to manage complex machine identities, human errors like hardcoding secrets become much more common, leading to them turning up in AI training data scraped by web crawlers as in the Common Crawl case.
Speaking to ITPro, Darren Meyer, security research advocate at Checkmarx, suggested this problem has been around for a while and is only set to get worse as organizations increase the number of machine identities they need to manage by adopting new technologies.
“This problem of leaking credentials and related secrets because of machine-to-machine authentication requirements is a long-standing and growing issue,” he said.
“New use cases like training AI models on otherwise private data will definitely increase the likelihood that secrets leak, as well as the impact of those leaks.”
Ev Kontsevoy, CEO at Teleport, added he was not surprised by these findings and that at the current rate the industry is on an “unsustainable path for growing infrastructure complexity”. Kontsevoy further warned that this will continue to happen unless the industry changes its understanding of identity.
"It’s never surprising seeing that secrets like APIs keys are making their way into places they shouldn’t be. We are on an unsustainable path for growing infrastructure complexity that will continuously expose secrets and waste the productivity of engineers, unless we rethink our approach to identity and security,” he argued.
“Every emerging technology being brought into production is on one hand critical for businesses to stay competitive – because your competitors are adopting that tech as well – but on the other, it represents yet another attack vector,” Kontsevoy added.
“Every single layer of a technology listening on the network has its own idea of users, its own role-based access control, its own configuration and configuration syntax. That requires expertise, which most teams today lack to secure every little thing they have, and yet the future keeps bringing new things they need to secure.”
Organizations have their work cut out for them
Meyer said addressing this problem will not be easy and organizations have two relatively stark challenges ahead of them to avoid exposing their secrets, whether through AI training data or otherwise.
“Organizations need to do two relatively challenging things. Firstly, organisations should be seeking to avoid using long-life secrets for machine-to-machine authentication, replacing such systems with OIDC or other similar systems that use short-lived tokens wherever possible,” he told ITPro.
“This reduces the impact of a secrets leak, as the leaked secrets are much more likely to have expired by the time an attacker gets hold of them, making them useless."
“Secondly, they should have strong processes around AI adoption to ensure that AI agents and related systems don’t have access to sensitive data in most cases. This type of control has to happen at every stage, from alerting about secrets being leaked during development to carefully monitoring the data being fed into AI models during training and operation.”
He added that AI agents that require varying levels of access to potentially sensitive areas of their IT environment will introduce further identity management challenges.
“In cases where the purpose of the AI agent requires access to secrets or other sensitive data, those adoption processes should ensure that access to the model and any implementing applications is tightly controlled.”
MORE FROM ITPRO

Solomon Klappholz is a former staff writer for ITPro and ChannelPro. He has experience writing about the technologies that facilitate industrial manufacturing, which led to him developing a particular interest in cybersecurity, IT regulation, industrial infrastructure applications, and machine learning.
-
Post-cloud strategy: Architecting the next enterprise stackAs enterprises rethink their dependence on hyperscale, hybrid architectures are emerging as the new foundation for resilient, AI-ready infrastructure
-
Anthropic just launched Claude Fable 5, its first Mythos-class AI modelNews The launch of Claude Fable 5 marks the first public release of a Mythos-class AI model
-
‘These sorts of post-compromise techniques used to be restricted to actors with the technical knowledge to carry them out’: Anthropic warns AI is helping lower the bar for up-and-coming hackersNews AI is making it harder to differentiate between high and low-skilled actors
-
AI is shrinking attack windows, and it’s forcing a complete rethink of cyber resilience – here’s how organizations can prepareNews Commvault has urged companies to improve their business continuity and resilience plans in the face of flaws spotted by AI
-
UK wants an AI-powered anti-hacking systemNews GCHQ is building a national cyber defence capability powered by AI – though it may take five years
-
UK and Australia agree to work more closely on AI securityNews A new deal sees Australia set up a new AI safety institute, which will share research with the UK AI Security Institute
-
AI is getting better at security – and it's doing it faster than expectedNews UK AISI warns that AI models are already exceeding existing benchmarks for testing
-
Google says AI is now being used to build zero-days – and we just narrowly avoided a 'mass exploitation event'News Google cyber researchers think they’ve found the first AI-generated zero-day exploit