Anthropic wants to demystify the inner workings of its Claude AI models – and it might force OpenAI’s hand on transparency

Homepage for the Claude AI model, developed by AI startup Anthropic, pictured on a laptop screen.

(Image credit: Getty Images)

published 30 August 2024

AI startup Anthropic has elected to publish the system prompts for its flagship Claude large language model as part of a new effort to improve transparency in the private model ecosystem.

System prompts comprise a set of rules or instructions that dictate how a model should respond to queries, outlining exactly what it can and can’t respond to, as well as the sentiment embodied in the output.

The instructions are intended to prevent the model from behaving maliciously, and steering its responses into a uniform tone and style, namely that of a helpful and inquisitive assistant.

The decision to make this information publicly available will help developers, as well as the general public gain a better understanding of how these often mystified models actually work in practice, Anthropic said.

Experts have broadly welcomed the move, describing it as a positive step in terms of AI ethics, and one that is aimed at giving the company an edge in the battle against competitors such as OpenAI.

The move was announced on 26 August by Alex Albert, head of developer relations at Anthropic, who revealed the newly disclosed system prompts will be included in a new release section on Anthropic documents.

Speaking to ITPro, Alastair Paterson, CEO and co-founder of data protection firm Harmonic Security said the move was likely an attempt to present Anthropic as leading the market in terms of transparency and responsible AI governance.

“Anthropic seem to be trying to position themselves as ‘more-open’ than competitors such as OpenAI and Google which may help to give themselves a differentiator in the market. OpenAI, in particular, has been criticized for not living up to being ‘open’ by none other than Elon Musk – so if anything, it would seem a direct challenge to OpenAI.”

A prominent member of OpenAI’s GPT Builder creator program, Nick Dobos, who has built a number of custom GPTs on the platform, voiced his support for the move on X, contrasting Antropic’s openness with that of OpenAI.

Criticism of OpenAI’s transparency has not been limited to external parties, with a group of current and former employees penning an anonymous open letter warning that the company had strong incentives to “avoid effective oversight” over their models.

“AI companies possess substantial non-public information about the capabilities and limitations of their systems, the adequacy of their protective measures, and the risk levels of different kinds of harm. However, they currently have only weak obligations to share some of this information with governments, and none with civil society. We do not think they can all be relied upon to share it voluntarily,” the letter stated.

Prompt engineering threats not significantly increased by Anthropics decision to go public

Cyber criminals could be unintended beneficiaries of Anthropic’s decision to make Claude’s system prompts public. Some industry stakeholders have warned that threat actors could potentially leverage this information to gain a deeper understanding of system frailties, which can then be exploited in the future.

This threat should not be exaggerated, however, according to Peter van der Putten, director of the AI Labat Pegasystems and assistant professor of AI at Leiden University. Putten told ITPro making these prompts public was more important than any associated risks.

“I see the move to publish system messages as a positive one, and a significant one from an AI ethics principles perspective. On the flip side, one should not overestimate both the importance of the system prompt, nor exaggerate the risks,” he argued.

Paterson came to a similar conclusion, adding that Anthropic likely weighed the potential threats associated with the move against the benefits.

“It is likely that a judgment will have been made that any additional risk posed by providing these system prompts is outweighed by the benefits of publicity and the value of being able to position themselves as more virtuous than their competitors."

Vincenzo Ciancaglini, senior threat researcher at Trend Micro, told ITPro attackers already had various ways in which they could corrupt LLMs without needing access to the system prompts, and in many cases actively try to remove these prompts.

“Understanding the system prompt for a specific LLM could give insights into the inner workings of the LLM itself, which might help in some classes of jailbreaking. However, there are plenty of other jailbreaking techniques which are independent of the system prompt. Many times, the prompt injection starts with trying to get the LLM to forget the system prompt.”

Shaked Reiner, principal security researcher at CyberArk Labs, agreed with this assessment, adding that the public benefits of publishing the system prompts were more important than any perceived increase in the threat of malicious prompt engineering.

“Attackers will inevitably get their hands on system prompts, but by making them publicly available, the company empowers normal users who otherwise wouldn't have access to this information,” Reiner told ITPro.

“As humanity is still in the early stages of our AI journey, we have yet to establish adequate safety and security standards. We believe that sharing more information about private models publicly will contribute to the development of these standards.”

Solomon Klappholz is a former staff writer for ITPro and ChannelPro. He has experience writing about the technologies that facilitate industrial manufacturing, which led to him developing a particular interest in cybersecurity, IT regulation, industrial infrastructure applications, and machine learning.