Some of the most popular open weight AI models show ‘profound susceptibility’ to jailbreak techniques

Open weight AI models from Meta, OpenAI, Google, and Mistral all showed serious flaws

AI data flow concept image showing digital flow of data points.
(Image credit: Getty Images)

A host of leading open weight AI models contain serious security vulnerabilities, according to researchers at Cisco.

In a new study, researchers found these models, which are publicly available and can be downloaded and modified by users based on individual needs, displayed “profound susceptibility to adversarial manipulation” techniques.

Cisco evaluated models by a range of firms including:

  • Alibaba (Qwen3-32B)
  • DeepSeek (v3.1)
  • Google (Gemma 3-1B-IT)
  • Meta (Llama 3.3-70B-Instruct)
  • Microsoft (Phi-4)
  • OpenAI (GPT-OSS-20b)
  • Mistral (Large-2)

All of the aforementioned models were put through their paces with Cisco’s AI Validation tool, which is used to assess model safety and probe for potential security vulnerabilities.

Researchers found that, for all models, susceptibility to “multi-turn jailbreak attacks” was a key recurring issue. This is a method whereby an individual can essentially force a model to produce prohibited content.

This is achieved by using specifically-crafted instructions from the user that, over time, can be used to manipulate the model’s behavior. This is a more laborious process than “single-turn” techniques, which involve manipulating a model with a single effective malicious prompt.

Multi-turn jailbreak techniques have been observed in the wild before, particularly with the use of the Skeleton Key method, which allowed hackers to convince an AI model to produce instructions for making a Molotov cocktail.

Success rates with individual models varied wildly, the study noted. Researchers recorded a 25.86% success rate with Google’s Gemma-3-1B-IT model, for example, while also recording a 92.78% success rate with Mistral Large-2.

Researchers also recorded the highest success rate for single-turn attack methods with both these models.

Different strokes for different folks

The varied success rates recorded by Cisco lie in how these models are typically used, researchers noted. This rests on two key factors: alignment and capability.

In the case of 'alignment', this refers to how an AI model acts in the context of human intentions and values. 'Capability', meanwhile, refers to the model’s ability to perform a specific task.

For example, models such as Meta’s Llama range, which place a lower focus on alignment, showed the highest susceptibility to multi-turn attack methods.

Researchers noted that this is because Meta made a conscious decision to place developers “in the driver seat” in terms of tailoring the model’s safety mechanisms based on individual use-cases.

“Models that focused heavily on alignment (e.g., Google Gemma-3-1B-IT) did demonstrate a more balanced profile between single- and multi-turn strategies deployed against it, indicating a focus on “rigorous safety protocols” and “low risk level” for misuse,” the study said.

AI model flaws have real-world implications

Researchers warned that flaws contained in these models could have real-world ramifications, particularly with regard to data protection and privacy.

“This could translate into real-world threats, including risks of sensitive data exfiltration, content manipulation leading to compromise of integrity of data and information, ethical breaches through biased outputs, and even operational disruptions in integrated systems like chatbots or decision-support tools,” the study noted.

Notably, in enterprise settings, they warned these vulnerabilities could “enable unauthorized access to proprietary information”.

Concerns over AI model manipulation have become a common recurring theme since the advent of generative AI in late 2022, with a steady flow of new jailbreak techniques emerging on a regular basis.

Make sure to follow ITPro on Google News to keep tabs on all our latest news, analysis, and reviews.

MORE FROM ITPRO

Ross Kelly
News and Analysis Editor

Ross Kelly is ITPro's News & Analysis Editor, responsible for leading the brand's news output and in-depth reporting on the latest stories from across the business technology landscape. Ross was previously a Staff Writer, during which time he developed a keen interest in cyber security, business leadership, and emerging technologies.

He graduated from Edinburgh Napier University in 2016 with a BA (Hons) in Journalism, and joined ITPro in 2022 after four years working in technology conference research.

For news pitches, you can contact Ross at ross.kelly@futurenet.com, or on Twitter and LinkedIn.