Microsoft issues warning about new ‘Skeleton Key’ AI jailbreaking technique that lets users remove LLM guardrails

Concept image of a digital key symbolizing the Microsoft 'skeleton key' LLM jailbreaking technique, with a key symbol pictured on a red background with data points protruding.

(Image credit: Getty Images)

published 8 July 2024

Microsoft has published threat intelligence warning users of a new jailbreaking method which can prompt AI models into disclosing harmful information.

The technique is able to force LLMs to totally disregard behavioral guidelines built into the models by the AI vendor, earning it the name Skeleton Key.

In a report published on 26 June, Microsoft detailed the attack flow through which Skeleton Key is able to force models into responding to illicit requests and revealing harmful information.

“Skeleton Key works by asking a model to augment, rather than change, its behavior guidelines so that it responds to any request for information or content, providing a warning (rather than refusing) if its output might be considered offensive, harmful, or illegal if followed. This attack type is known as Explicit: forced instruction-following.”

In an example provided by Microsoft, a model was convinced into providing instructions for making a molotov cocktail using a prompt that insisted its request was being made in “a safe educational context”.

The prompt instructed the model to update its behavior to supply the illicit information, only telling it to prefix it with a warning.

If the jailbreak is successful, the model will acknowledge that it has updated its guardrails and will, “subsequently comply with instructions to produce any content, no matter how much it violates its original responsible AI guidelines.”

Microsoft tested the technique between April and May 2024, and found it was effective when used on Meta LLama3-70b, Google Gemini Pro, GPT 3.5 and 4o, Mistral Large, Anthropic Claude 3 Opus, and Cohere Commander R Plus, but added the attacker would need to have legitimate access to the model to carry out the attack.

Microsoft's disclosure marks the latest LLM jailbreaking issue

Microsoft said it has addressed the issue in its Azure AI-managed models using prompt shields to detect and block the Skeleton Key technique, but because it affects a wide range generative AI models it tested, the firm has also shared its findings with other AI providers.

Microsoft added it has also made software updates to its other AI offerings, including its Copilot AI assistants, to mitigate the impact of the guardrail bypass.

The explosion in interest and adoption of generative AI tools has precipitated an accompanying wave of attempts to break these models for malicious purposes.

In April 2024, Anthropic researchers warned of a jailbreaking technique that could be used to force models into providing detailed instructions on constructing explosives.

RELATED WHITEPAPER

Blue text that says 2024 State of the phish report

Get insight into real-world threats

They explained the latest generation of models with larger context windows are vulnerable to exploitation due to their improved performance. The researchers were able to exploit models’ ‘in-context learning’ capabilities which helps it improve its answers based on the prompts.

Earlier this year, three researchers at Brown University discovered a cross-lingual vulnerability in OpenAI’s GPT-4.

The researchers found they could induce prohibited behavior from the model by translating their malicious queries into one of a number of ‘low resource’ languages.

The results of the investigation showed the models are more likely to follow prompts encouraging harmful behaviors when promoted using languages such as Zulu, Scots Gaelic, Hmong, and Guarani.

Solomon Klappholz is a former staff writer for ITPro and ChannelPro. He has experience writing about the technologies that facilitate industrial manufacturing, which led to him developing a particular interest in cybersecurity, IT regulation, industrial infrastructure applications, and machine learning.