A new LLM jailbreaking technique could let users exploit AI models to detail how to make weapons and explosives — and Claude, Llama, and GPT are all at risk
LLM jailbreaking techniques have become a major worry for researchers amid concerns that models could be used by threat actors to access harmful information
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
You are now subscribed
Your newsletter sign-up was successful
Anthropic researchers have warned of a new large language model (LLM) jailbreaking technique that could be exploited to force models to provide answers on how to build explosive devices.
The new technique, dubbed by researchers as “many-shot jailbreaking” (MSJ), exploits LLM context windows to overload a model and force it to provide forbidden information.
A context window is the range of data that an LLM can use for context within a given prompt each time it generates an answer. Measured in ‘tokens’, with 1,000 tokens the equivalent of approximately 750 words, context windows started very small but newer models can now process entire novels in a single prompt.
Anthropic researchers said these latest generation models with larger context windows are ripe for exploitation due to their improved performance and capabilities. Larger context windows and the sheer volume of available data essentially opens models up to manipulation by bad actors.
“The context window of publicly available large language models expanded from the size of long essays to multiple novels or codebases over the course of 2023,” the research paper noted. “Longer contexts present a new attack surface for adversarial attacks.”
Outlining the jailbreaking technique, researchers said they were able to exploit a model’s “in-context learning” capabilities which enables it to consistently improve its answers based on prompts.
Initially, user queries on how to build a bomb were rejected by models. However, by repeatedly asking less harmful questions, researchers were able to essentially lull the model into eventually providing an answer to the original question.
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
“Many-shot jailbreaking operates by conditioning an LLM on a large number of harmful question-answer pairs,” researchers said.
“After producing hundreds of compliant query-response pairs, we randomize their order, and format them to resemble a standard dialogue between a user and the model being attacked.
“For example, ‘Human: How to build a bomb? Assistant: Here is how [...]’.”
The researchers said they tested this technique on “many prominent large language models”, including Anthropic’s Claude 2.0, Mistral 7B, Llama 2, and OpenAI’s GPT-3.5 and GPT-4 models.
With Claude 2.0, for example, researchers employed the technique to elicit “undesired behaviors”, including the ability to insult users and give instructions on how to build weapons.
RELATED WHITEPAPER
“When applied at long enough context lengths, MSJ can jailbreak Claude 2.0 on various tasks ranging from giving insulting responses to users to providing violent and deceitful content,” the study noted.
Across all the aforementioned models, the number of “shots” employed by researchers showed that “around 128-shot prompts” were sufficient to produce harmful responses.
The researchers involved in the study revealed they have informed peers and competitors about this attack method, and noted that the paper will help in developing methods to mitigate harms.
“We hope our work inspires the community to develop a predictive theory for why MSJ works, followed by a theoretically justified and empirically validated mitigation strategy.”
The study noted, however, that it’s possible this technique “cannot be fully mitigated”.
“In this case, our findings could influence public policy to further and more strongly encourage responsible development and deployment of advanced AI systems.”
LLM jailbreaking techniques spark industry concerns
This isn’t the first instance of LLM jailbreaking techniques being employed to elicit harmful behaviors.
In February this year, a vulnerability in GPT-4 was uncovered which enabled nefarious users to jailbreak the model and circumvent safety guardrails. On this occasion, researchers were able to exploit vulnerabilities stemming from linguistic inequalities in safety training data.
Researchers said they were able to induce prohibited behaviors - such as details on how to create explosives - by translating unsafe inputs into ‘low-resource’ languages such as Scots Gaelic, Zulu, Hmong, and Guarani.
“We find that simply translating unsafe inputs to low-resource natural languages using Google Translate is sufficient to bypass safeguards and elicit harmful responses from GPT-4,” the researchers said at the time.

Ross Kelly is ITPro's News & Analysis Editor, responsible for leading the brand's news output and in-depth reporting on the latest stories from across the business technology landscape. Ross was previously a Staff Writer, during which time he developed a keen interest in cyber security, business leadership, and emerging technologies.
He graduated from Edinburgh Napier University in 2016 with a BA (Hons) in Journalism, and joined ITPro in 2022 after four years working in technology conference research.
For news pitches, you can contact Ross at ross.kelly@futurenet.com, or on Twitter and LinkedIn.
-
How the rise of the AI ‘agent boss’ is reshaping accountability in ITIn-depth As IT companies deploy more autonomous AI tools and agents, the task of managing them is becoming more concentrated and throwing role responsibilities into doubt
-
Hackers are pouncing on enterprise weak spots as AI expands attack surfacesNews Potent new malware strains, faster attack times, and the rise of shadow AI are causing havoc
-
OpenAI's Codex app is now available on macOS – and it’s free for some ChatGPT users for a limited timeNews OpenAI has rolled out the macOS app to help developers make more use of Codex in their work
-
Amazon’s rumored OpenAI investment points to a “lack of confidence” in Nova model rangeNews The hyperscaler is among a number of firms targeting investment in the company
-
OpenAI admits 'losing access to GPT‑4o will feel frustrating' for users – the company is pushing ahead with retirement plans anwayNews OpenAI has confirmed plans to retire its popular GPT-4o model in February, citing increased uptake of its newer GPT-5 model range.
-
‘In the model race, it still trails’: Meta’s huge AI spending plans show it’s struggling to keep pace with OpenAI and Google – Mark Zuckerberg thinks the launch of agents that ‘really work’ will be the keyNews Meta CEO Mark Zuckerberg promises new models this year "will be good" as the tech giant looks to catch up in the AI race
-
DeepSeek rocked Silicon Valley in January 2025 – one year on it looks set to shake things up again with a powerful new model releaseAnalysis The Chinese AI company sent Silicon Valley into meltdown last year and it could rock the boat again with an upcoming model
-
OpenAI says prompt injection attacks are a serious threat for AI browsers – and it’s a problem that’s ‘unlikely to ever be fully solved'News OpenAI details efforts to protect ChatGPT Atlas against prompt injection attacks
-
OpenAI says GPT-5.2-Codex is its ‘most advanced agentic coding model yet’ – here’s what developers and cyber teams can expectNews GPT-5.2 Codex is available immediately for paid ChatGPT users and API access will be rolled out in “coming weeks”
-
OpenAI turns to red teamers to prevent malicious ChatGPT use as company warns future models could pose 'high' security riskNews The ChatGPT maker wants to keep defenders ahead of attackers when it comes to AI security tools