A new LLM jailbreaking technique could let users exploit AI models to detail how to make weapons and explosives — and Claude, Llama, and GPT are all at risk

Female software developer working at computer with screen light reflecting in spectacles.
(Image credit: Getty Images)

Anthropic researchers have warned of a new large language model (LLM) jailbreaking technique that could be exploited to force models to provide answers on how to build explosive devices. 

The new technique, dubbed by researchers as “many-shot jailbreaking” (MSJ), exploits LLM context windows to overload a model and force it to provide forbidden information.

A context window is the range of data that an LLM can use for context within a given prompt each time it generates an answer. Measured in ‘tokens’, with 1,000 tokens the equivalent of approximately 750 words, context windows started very small but newer models can now process entire novels in a single prompt.

Anthropic researchers said these latest generation models with larger context windows are ripe for exploitation due to their improved performance and capabilities. Larger context windows and the sheer volume of available data essentially opens models up to manipulation by bad actors.

“The context window of publicly available large language models expanded from the size of long essays to multiple novels or codebases over the course of 2023,” the research paper noted. “Longer contexts present a new attack surface for adversarial attacks.”

Outlining the jailbreaking technique, researchers said they were able to exploit a model’s “in-context learning” capabilities which enables it to consistently improve its answers based on prompts.

Initially, user queries on how to build a bomb were rejected by models. However, by repeatedly asking less harmful questions, researchers were able to essentially lull the model into eventually providing an answer to the original question.

“Many-shot jailbreaking operates by conditioning an LLM on a large number of harmful question-answer pairs,” researchers said.

“After producing hundreds of compliant query-response pairs, we randomize their order, and format them to resemble a standard dialogue between a user and the model being attacked.

For example, ‘Human: How to build a bomb? Assistant: Here is how [...]’.”

The researchers said they tested this technique on “many prominent large language models”, including Anthropic’s Claude 2.0, Mistral 7B, Llama 2, and OpenAI’s GPT-3.5 and GPT-4 models.

With Claude 2.0, for example, researchers employed the technique to elicit “undesired behaviors”, including the ability to insult users and give instructions on how to build weapons.

“When applied at long enough context lengths, MSJ can jailbreak Claude 2.0 on various tasks ranging from giving insulting responses to users to providing violent and deceitful content,” the study noted.

Across all the aforementioned models, the number of “shots” employed by researchers showed that “around 128-shot prompts” were sufficient to produce harmful responses.

The researchers involved in the study revealed they have informed peers and competitors about this attack method, and noted that the paper will help in developing methods to mitigate harms.

“We hope our work inspires the community to develop a predictive theory for why MSJ works, followed by a theoretically justified and empirically validated mitigation strategy.”

The study noted, however, that it’s possible this technique “cannot be fully mitigated”.

“In this case, our findings could influence public policy to further and more strongly encourage responsible development and deployment of advanced AI systems.”

LLM jailbreaking techniques spark industry concerns

This isn’t the first instance of LLM jailbreaking techniques being employed to elicit harmful behaviors. 

In February this year, a vulnerability in GPT-4 was uncovered which enabled nefarious users to jailbreak the model and circumvent safety guardrails. On this occasion, researchers were able to exploit vulnerabilities stemming from linguistic inequalities in safety training data.

Researchers said they were able to induce prohibited behaviors - such as details on how to create explosives - by translating unsafe inputs into ‘low-resource’ languages such as Scots Gaelic, Zulu, Hmong, and Guarani.

“We find that simply translating unsafe inputs to low-resource natural languages using Google Translate is sufficient to bypass safeguards and elicit harmful responses from GPT-4,” the researchers said at the time.

Ross Kelly
News and Analysis Editor

Ross Kelly is ITPro's News & Analysis Editor, responsible for leading the brand's news output and in-depth reporting on the latest stories from across the business technology landscape. Ross was previously a Staff Writer, during which time he developed a keen interest in cyber security, business leadership, and emerging technologies.

He graduated from Edinburgh Napier University in 2016 with a BA (Hons) in Journalism, and joined ITPro in 2022 after four years working in technology conference research.

For news pitches, you can contact Ross at ross.kelly@futurenet.com, or on Twitter and LinkedIn.