A new LLM jailbreaking technique could let users exploit AI models to detail how to make weapons and explosives — and Claude, Llama, and GPT are all at risk
LLM jailbreaking techniques have become a major worry for researchers amid concerns that models could be used by threat actors to access harmful information
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
You are now subscribed
Your newsletter sign-up was successful
Anthropic researchers have warned of a new large language model (LLM) jailbreaking technique that could be exploited to force models to provide answers on how to build explosive devices.
The new technique, dubbed by researchers as “many-shot jailbreaking” (MSJ), exploits LLM context windows to overload a model and force it to provide forbidden information.
A context window is the range of data that an LLM can use for context within a given prompt each time it generates an answer. Measured in ‘tokens’, with 1,000 tokens the equivalent of approximately 750 words, context windows started very small but newer models can now process entire novels in a single prompt.
Anthropic researchers said these latest generation models with larger context windows are ripe for exploitation due to their improved performance and capabilities. Larger context windows and the sheer volume of available data essentially opens models up to manipulation by bad actors.
“The context window of publicly available large language models expanded from the size of long essays to multiple novels or codebases over the course of 2023,” the research paper noted. “Longer contexts present a new attack surface for adversarial attacks.”
Outlining the jailbreaking technique, researchers said they were able to exploit a model’s “in-context learning” capabilities which enables it to consistently improve its answers based on prompts.
Initially, user queries on how to build a bomb were rejected by models. However, by repeatedly asking less harmful questions, researchers were able to essentially lull the model into eventually providing an answer to the original question.
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
“Many-shot jailbreaking operates by conditioning an LLM on a large number of harmful question-answer pairs,” researchers said.
“After producing hundreds of compliant query-response pairs, we randomize their order, and format them to resemble a standard dialogue between a user and the model being attacked.
“For example, ‘Human: How to build a bomb? Assistant: Here is how [...]’.”
The researchers said they tested this technique on “many prominent large language models”, including Anthropic’s Claude 2.0, Mistral 7B, Llama 2, and OpenAI’s GPT-3.5 and GPT-4 models.
With Claude 2.0, for example, researchers employed the technique to elicit “undesired behaviors”, including the ability to insult users and give instructions on how to build weapons.
RELATED WHITEPAPER
“When applied at long enough context lengths, MSJ can jailbreak Claude 2.0 on various tasks ranging from giving insulting responses to users to providing violent and deceitful content,” the study noted.
Across all the aforementioned models, the number of “shots” employed by researchers showed that “around 128-shot prompts” were sufficient to produce harmful responses.
The researchers involved in the study revealed they have informed peers and competitors about this attack method, and noted that the paper will help in developing methods to mitigate harms.
“We hope our work inspires the community to develop a predictive theory for why MSJ works, followed by a theoretically justified and empirically validated mitigation strategy.”
The study noted, however, that it’s possible this technique “cannot be fully mitigated”.
“In this case, our findings could influence public policy to further and more strongly encourage responsible development and deployment of advanced AI systems.”
LLM jailbreaking techniques spark industry concerns
This isn’t the first instance of LLM jailbreaking techniques being employed to elicit harmful behaviors.
In February this year, a vulnerability in GPT-4 was uncovered which enabled nefarious users to jailbreak the model and circumvent safety guardrails. On this occasion, researchers were able to exploit vulnerabilities stemming from linguistic inequalities in safety training data.
Researchers said they were able to induce prohibited behaviors - such as details on how to create explosives - by translating unsafe inputs into ‘low-resource’ languages such as Scots Gaelic, Zulu, Hmong, and Guarani.
“We find that simply translating unsafe inputs to low-resource natural languages using Google Translate is sufficient to bypass safeguards and elicit harmful responses from GPT-4,” the researchers said at the time.

Ross Kelly is ITPro's News & Analysis Editor, responsible for leading the brand's news output and in-depth reporting on the latest stories from across the business technology landscape. Ross was previously a Staff Writer, during which time he developed a keen interest in cyber security, business leadership, and emerging technologies.
He graduated from Edinburgh Napier University in 2016 with a BA (Hons) in Journalism, and joined ITPro in 2022 after four years working in technology conference research.
For news pitches, you can contact Ross at ross.kelly@futurenet.com, or on Twitter and LinkedIn.
-
Amazon’s ‘go build it’ AI strategy is a perfect fit for OpenAI’s big enterprise pushNews OpenAI and Amazon are both vying to offer customers DIY-style AI development services
-
Cyber resilience tunnel vision is leaving enterprises open to external threatsNews Many enterprises are overlooking supply chain risks as part of their cyber resilience strategies
-
Amazon’s ‘go build it’ AI strategy is a perfect fit for OpenAI’s big enterprise pushNews OpenAI and Amazon are both vying to offer customers DIY-style AI development services
-
February rundown: SaaS-pocalypse now?ITPro Podcast Geopolitical uncertainty is intensifying public and private sector focus on true sovereign workloads
-
‘A huge vote of confidence’: London set to host OpenAI's largest research hub outside USNews OpenAI wants to capitalize on the UK’s “world-class” talent in areas such as machine learning
-
Sam Altman just said what everyone is thinking about AI layoffsNews AI layoff claims are overblown and increasingly used as an excuse for “traditional drivers” when implementing job cuts
-
OpenAI's Codex app is now available on macOS – and it’s free for some ChatGPT users for a limited timeNews OpenAI has rolled out the macOS app to help developers make more use of Codex in their work
-
Amazon’s rumored OpenAI investment points to a “lack of confidence” in Nova model rangeNews The hyperscaler is among a number of firms targeting investment in the company
-
OpenAI admits 'losing access to GPT‑4o will feel frustrating' for users – the company is pushing ahead with retirement plans anwayNews OpenAI has confirmed plans to retire its popular GPT-4o model in February, citing increased uptake of its newer GPT-5 model range.
-
‘In the model race, it still trails’: Meta’s huge AI spending plans show it’s struggling to keep pace with OpenAI and Google – Mark Zuckerberg thinks the launch of agents that ‘really work’ will be the keyNews Meta CEO Mark Zuckerberg promises new models this year "will be good" as the tech giant looks to catch up in the AI race