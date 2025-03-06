The discovery of almost 12,000 valid secrets in the archive of a popular AI training dataset is the result of the industry’s inability to keep up with the complexities of identity management, experts have told ITPro.

Researchers at Truffle Security found nearly 12,000 ‘live’ API keys and passwords when analysing the Common Crawl archive used to train open source LLMs such as DeepSeek

The researchers trawled through the December 2024 Common Crawl archive, consisting of 400TB of web data gathered from 2.67 billion web pages, and found 11,908 live secrets using their open source secret scanner, TruffleHog.

The report found these secrets had been hardcoded in the front-end HTML and JavaScript, rather than using server-side environment variables.

In total, TruffleHog found 219 different secret types in the archive including API keys for AWS and Walkscore.

Mailchimp API keys were the most frequently leaked secret, however, with the researchers finding 1,500 unique keys hardcoded into HTML forms and JavaScript snippets.

The report warned that this exposure of LLMs to examples of code containing hardcoded secrets could lead to them suggesting these secrets in their model outputs, although it noted fine-tuning, alignment techniques, prompt context, and alternative training data can mitigate this risk.

Nonetheless, malicious actors could use the keys for phishing campaigns, data exfiltration, and brand impersonation, researchers said.

Industry on an “unsustainable path for growing infrastructure complexity”

IT leaders have warned that an increasingly complex technology landscape, combined with an ever expanding number of machine identities for organizations to manage, is a major factor in why these secrets have been exposed.

As developers struggle to manage complex machine identities, human errors like hardcoding secrets become much more common, leading to them turning up in AI training data scraped by web crawlers as in the Common Crawl case.

Speaking to ITPro, Darren Meyer, security research advocate at Checkmarx, suggested this problem has been around for a while and is only set to get worse as organizations increase the number of machine identities they need to manage by adopting new technologies.

“This problem of leaking credentials and related secrets because of machine-to-machine authentication requirements is a long-standing and growing issue,” he said.

“New use cases like training AI models on otherwise private data will definitely increase the likelihood that secrets leak, as well as the impact of those leaks.”

Ev Kontsevoy, CEO at Teleport, added he was not surprised by these findings and that at the current rate the industry is on an “unsustainable path for growing infrastructure complexity”. Kontsevoy further warned that this will continue to happen unless the industry changes its understanding of identity.

"It’s never surprising seeing that secrets like APIs keys are making their way into places they shouldn’t be. We are on an unsustainable path for growing infrastructure complexity that will continuously expose secrets and waste the productivity of engineers, unless we rethink our approach to identity and security,” he argued.

“Every emerging technology being brought into production is on one hand critical for businesses to stay competitive – because your competitors are adopting that tech as well – but on the other, it represents yet another attack vector,” Kontsevoy added.

“Every single layer of a technology listening on the network has its own idea of users, its own role-based access control, its own configuration and configuration syntax. That requires expertise, which most teams today lack to secure every little thing they have, and yet the future keeps bringing new things they need to secure.”

Organizations have their work cut out for them

Meyer said addressing this problem will not be easy and organizations have two relatively stark challenges ahead of them to avoid exposing their secrets, whether through AI training data or otherwise.

“Organizations need to do two relatively challenging things. Firstly, organisations should be seeking to avoid using long-life secrets for machine-to-machine authentication, replacing such systems with OIDC or other similar systems that use short-lived tokens wherever possible,” he told ITPro.

“This reduces the impact of a secrets leak, as the leaked secrets are much more likely to have expired by the time an attacker gets hold of them, making them useless."

“Secondly, they should have strong processes around AI adoption to ensure that AI agents and related systems don’t have access to sensitive data in most cases. This type of control has to happen at every stage, from alerting about secrets being leaked during development to carefully monitoring the data being fed into AI models during training and operation.”

He added that AI agents that require varying levels of access to potentially sensitive areas of their IT environment will introduce further identity management challenges.

“In cases where the purpose of the AI agent requires access to secrets or other sensitive data, those adoption processes should ensure that access to the model and any implementing applications is tightly controlled.”