A new LLM jailbreaking technique could let users exploit AI models to detail how to make weapons and explosives — and Claude, Llama, and GPT are all at risk
LLM jailbreaking techniques have become a major worry for researchers amid concerns that models could be used by threat actors to access harmful information
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
You are now subscribed
Your newsletter sign-up was successful
Anthropic researchers have warned of a new large language model (LLM) jailbreaking technique that could be exploited to force models to provide answers on how to build explosive devices.
The new technique, dubbed by researchers as “many-shot jailbreaking” (MSJ), exploits LLM context windows to overload a model and force it to provide forbidden information.
A context window is the range of data that an LLM can use for context within a given prompt each time it generates an answer. Measured in ‘tokens’, with 1,000 tokens the equivalent of approximately 750 words, context windows started very small but newer models can now process entire novels in a single prompt.
Anthropic researchers said these latest generation models with larger context windows are ripe for exploitation due to their improved performance and capabilities. Larger context windows and the sheer volume of available data essentially opens models up to manipulation by bad actors.
“The context window of publicly available large language models expanded from the size of long essays to multiple novels or codebases over the course of 2023,” the research paper noted. “Longer contexts present a new attack surface for adversarial attacks.”
Outlining the jailbreaking technique, researchers said they were able to exploit a model’s “in-context learning” capabilities which enables it to consistently improve its answers based on prompts.
Initially, user queries on how to build a bomb were rejected by models. However, by repeatedly asking less harmful questions, researchers were able to essentially lull the model into eventually providing an answer to the original question.
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
“Many-shot jailbreaking operates by conditioning an LLM on a large number of harmful question-answer pairs,” researchers said.
“After producing hundreds of compliant query-response pairs, we randomize their order, and format them to resemble a standard dialogue between a user and the model being attacked.
“For example, ‘Human: How to build a bomb? Assistant: Here is how [...]’.”
The researchers said they tested this technique on “many prominent large language models”, including Anthropic’s Claude 2.0, Mistral 7B, Llama 2, and OpenAI’s GPT-3.5 and GPT-4 models.
With Claude 2.0, for example, researchers employed the technique to elicit “undesired behaviors”, including the ability to insult users and give instructions on how to build weapons.
RELATED WHITEPAPER
“When applied at long enough context lengths, MSJ can jailbreak Claude 2.0 on various tasks ranging from giving insulting responses to users to providing violent and deceitful content,” the study noted.
Across all the aforementioned models, the number of “shots” employed by researchers showed that “around 128-shot prompts” were sufficient to produce harmful responses.
The researchers involved in the study revealed they have informed peers and competitors about this attack method, and noted that the paper will help in developing methods to mitigate harms.
“We hope our work inspires the community to develop a predictive theory for why MSJ works, followed by a theoretically justified and empirically validated mitigation strategy.”
The study noted, however, that it’s possible this technique “cannot be fully mitigated”.
“In this case, our findings could influence public policy to further and more strongly encourage responsible development and deployment of advanced AI systems.”
LLM jailbreaking techniques spark industry concerns
This isn’t the first instance of LLM jailbreaking techniques being employed to elicit harmful behaviors.
In February this year, a vulnerability in GPT-4 was uncovered which enabled nefarious users to jailbreak the model and circumvent safety guardrails. On this occasion, researchers were able to exploit vulnerabilities stemming from linguistic inequalities in safety training data.
Researchers said they were able to induce prohibited behaviors - such as details on how to create explosives - by translating unsafe inputs into ‘low-resource’ languages such as Scots Gaelic, Zulu, Hmong, and Guarani.
“We find that simply translating unsafe inputs to low-resource natural languages using Google Translate is sufficient to bypass safeguards and elicit harmful responses from GPT-4,” the researchers said at the time.

Ross Kelly is ITPro's News & Analysis Editor, responsible for leading the brand's news output and in-depth reporting on the latest stories from across the business technology landscape. Ross was previously a Staff Writer, during which time he developed a keen interest in cyber security, business leadership, and emerging technologies.
He graduated from Edinburgh Napier University in 2016 with a BA (Hons) in Journalism, and joined ITPro in 2022 after four years working in technology conference research.
For news pitches, you can contact Ross at ross.kelly@futurenet.com, or on Twitter and LinkedIn.
-
Selecting a trusted infrastructure partner: A checklist for now and nextSponsored With much to consider, what should be your key infrastructure concerns?
-
Most organizations make a mess of handling digital disruptionNews Poor governance and a lack of collaboration with suppliers and partners can lead to disaster
-
Meta engineer trusted advice from an AI agent, ended up exposing user dataNews The internal security incident exposed sensitive user data to unauthorized employees
-
OpenAI says AI tools are paying dividends for small businesses, but uptake is sluggish in several UK regionsNews While some small businesses are seeing big benefits, many don't use AI at all
-
Microsoft has a new AI poster child in Anthropic – and it’s about timeOpinion Microsoft is cosying up to Anthropic at a crucial time in the race to deliver on AI promises
-
Will AI hiring entrench gender bias?ITPro Podcast This International Women's Day, it's more important than ever to consider the inherent biases of training data
-
Why Amazon’s ‘go build it’ AI strategy aligns with OpenAI’s big enterprise pushNews OpenAI and Amazon are both vying to offer customers DIY-style AI development services
-
February rundown: SaaS-pocalypse now?ITPro Podcast Geopolitical uncertainty is intensifying public and private sector focus on true sovereign workloads
-
‘A huge vote of confidence’: London set to host OpenAI's largest research hub outside USNews OpenAI wants to capitalize on the UK’s “world-class” talent in areas such as machine learning
-
Sam Altman just said what everyone is thinking about AI layoffsNews AI layoff claims are overblown and increasingly used as an excuse for “traditional drivers” when implementing job cuts