Hackers take advantage of AI hallucinations to sneak malicious software packages onto enterprise repositories

Software code concept image showing binary code on a digital interface with lines of code marked in blue.
(Image credit: Getty Images)

AI hallucinations have opened a potential path for hackers to deliver malicious packages into the repositories of large organizations, new research shows.

The claims, made by Lasso Security researcher Bar Lanyado, follow a 2023 investigation into whether package recommendations given by ChatGPT should be trusted by developers.

In June 2023, Lanyado found LLMs were frequently hallucinating when asked to recommend code libraries, suggesting developers download packages that don’t actually exist.

Lanyado warned this flaw could have a wide impact as a large portion of developers are starting to use AI chatbots over traditional search engines to research coding solutions.

This initial investigation, since updated in a follow-up probe, sheds further light on the growing scale of the problem and warned that threat actors are picking up on this trend to create malicious packages using names that are frequently hallucinated by the models. 

The research paper advised developers to only download from vetted libraries to avoid installing malicious code.

Lanyado stated that his aim with this latest study was to ascertain whether the model makers have dealt with issues he highlighted in August last year, or if package hallucinations remain a problem six months after his initial findings.

One of the most stark findings from this latest study related to a hallucinated python package repeatedly dreamt up by a number of models called ‘huggingface-cli’.

Researchers uploaded an empty package with the same name to gauge whether these were being uncritically downloaded to repositories by developers using coding assistants.

He also included a dummy package to verify how many downloads were by real people, or just scanners.

Lanyado found the fake Hugging Face package received over 30,000 authentic downloads in just three months, demonstrating the scale of the problem around relying on LLMs for development work.

Searching GitHub to see if the package had been added in any enterprise repositories, Lanyado found several large companies either use or recommend the package in their codebase.

One example involved Chinese multinational Alibaba, which provides instructions for installing the fake python package in the README of a repository dedicated to inhouse company research.

Almost one-in-three questions elicited one or more AI hallucinations

In his initial study, Lanyado used 457 questions on over 40 subjects in two programming languages to test the reliability of GPT-3.5 turbo’s suggestions.

Lanyado found that just under 30% of questions elicited at least one hallucinated package from the model. The latest research has an expanded scope in terms of the questions he puts to the models, as well as the number of programming languages he tested.

Researchers also tested a number of different models including GPT-3.5 and 4, Google’s Gemini Pro, and Cohere’s Command, comparing performance and looking for overlaps where hallucinated packages were received in more than one model.

The latest investigation used 2,500 questions on 100 subjects in 5 different programming languages including Python, node.js, go, .net, and ruby. 

RELATED WHITEPAPER

Lanyado’s prompts were optimized to simulate the workflow of software developers, using ‘how to’ questions to get the model to provide a solution that includes a specific package.

One other alteration made to the study was to use the Langchain application development framework to take care of the interactions with the models.

The benefit of this is using the system’s default prompt that instructs the model to say if it doesn’t know the answer, which he hoped would make the LLMs hallucinate less often.

The investigation found the frequency of hallucinations varied between models, with GPT-3.5 being the least likely to generate fake packages with a hallucination rate of 22.2%.

Not too far behind were GPT-4 (24.2%) and Cohere (29.1%), but by far the least reliable model in Lanyado’s testing was Gemini, which dreamed up code packages 64.5%.

Solomon Klappholz
Staff Writer

Solomon Klappholz is a Staff Writer at ITPro. He has experience writing about the technologies that facilitate industrial manufacturing which led to him developing a particular interest in IT regulation, industrial infrastructure applications, and machine learning.