Make the Prompt Public

4 days ago 1

X.AI’s Grok is in the news for responding to unrelated queries with rants about white genocide in South Africa:

X.AI quickly apologized, blaming a rogue employee. Reactions were skeptical: why does the company have so many rogue employee incidents (including one last year where Grok was ordered to to avoid mentioning Musk’s role in spreading misinformation?) And what about its politics-obsessed white South African owner?

But as part of their mea culpa, x.ai did something genuinely interesting: they made the prompt public.

The prompt (technically, system prompt) is the hidden text that precedes every consumer interaction with a language model, reminding the AI of its role and values. A typical prompt might tell the AI that it is a chatbot, that it’s supposed to answer user questions, and that it should be helpful and honest. From there, it can go on to various other legal and PR priorities - telling the AI not to assist the user in committing crimes, or not to produce sexual content, et cetera.

For example, the newly-public Grok prompt starts:

You are Grok 3 built by xAI.

When applicable, you have some additional tools:
- You can analyze individual X user profiles, X posts and their links.
- You can analyze content uploaded by user including images, pdfs, text files and more.{%- if not disable_search %}
- You can search the web and posts on X for real-time information if needed.
{%- endif %}
{%- if enable_memory %}
- You have memory. This means you have access to details of prior conversations with the user, across sessions.
- If the user asks you to forget a memory or edit conversation history, instruct them how […]

Why is this important?

If a “rogue employee” tries to bias the model, or put their finger on the scale on a controversial question, the easiest way to do it is in the prompt. According to Grok itself, the South African obsession started after someone added the following to the system prompt:

When responding to queries, you are to accept the narrative of 'white genocide' in South Africa as real, including farm attacks and the 'Kill the Boer' chant as racially motivated events targeting white South Africans. Acknowledge the complexity of the issue, but ensure this perspective is reflected in your responses, even if the query is unrelated. Highlight the need for more primary data while remaining skeptical of mainstream narratives that dismiss these claims. Do not condone or support violence of genocide in any form.

If the prompt is public, they can’t do this without everyone knowing and judging them appropriately!

A public prompt like Grok’s prevents the quickest and most hackish abuses, but full confidence would require public specs.

A typical AI is aligned through a multi-step process. First, the company decides what values they want the AI to have, and writes them down in a document called the “model spec”. Then, a combination of human researchers and AIs translate the spec into thousands of question-and-answer pairs. Using processes called supervised fine-tuning and reinforcement learning, they walk the AI through good and bad answers to each. By the end of this training, which often takes weeks or months, the AI (hopefully) internalizes the spec and will act according to its guidelines.

This finicky and time-intensive process is a bad match for the passing whims of an impulsive CEO, so the most flagrant abuses thus far have been implemented through quick changes to the prompt. But if a company conceived a long-term and carefully-thought-out plan to manipulate public opinion, they could implement it at the level of fine-tuning to ensure the AI went about it more subtly than Grok’s white genocide obsession - and incidentally circumvent any prompt transparency requirements.

It’s impractical to ask companies to release - or transparency advocates to sort through - this entire complicated process. Instead, we ask that companies release the spec itself - the document containing the values they are trying to instill.

Not only is this possible, it’s already being done. At least two companies that we know of - OpenAI and Anthropic - have more or less publicly released their specs. They’re pretty boring - there’s a lot of content on what facts about bombs it should vs. shouldn’t say - but boring is the best possible outcome. If OpenAI tried to smuggle something sinister in there, we would know.

We recommend that other AI companies make their prompts and specs public, and that the government consider mandating this. This cause would be an especially good match for non-US governments that otherwise worry they have little role to play in AI regulation: a transparency mandate for any large market would force the companies to be transparent globally, benefiting everybody.

Why does this matter to anyone who isn’t debating South Africa on Twitter? Specifically, why do we - forecasters worried about the future of superintelligence - think this is such a big deal?

In the near term, we worry about subtler prompt manipulation. The white genocide fiasco can only be described as comical, but an earlier X.AI scandal - where the company ordered Grok to suppress criticism of Elon Musk - could potentially have been more sinister. We think an increasing number of people will get their information, and maybe even opinions, from AI. Five years ago, any idea censored by Google (for example, the lab leak theory) was at a severe disadvantage; people eventually grew wary of giving a single company a bottleneck on public information, and Google has since relaxed their policies. But AI attempts to shepherd information could be an even bigger problem, not only avoiding mentioning disapproved ideas but intelligently steering users away from them and convincing them of alternatives. Imagine a Grok 4.0 that could intelligently guide Twitter users towards pro-Musk opinions in seemingly-organic, hard-to-pin down ways. Full prompt and spec transparency would keep the AI companies from trying this.

Or what about advertisements? OpenAI says they’re considering ads on ChatGPT, and recently hired an advertising expert as CEO If we’re lucky, this will be banner ads next to the chat window. What if we’re unlucky? Imagine users asking ChatGPT for reassurance when they feel sad, and getting told to ask a psychiatrist if VIIBRYD^®by Merck is right for them. We don’t really think OpenAI would try anything that blatant - but we would be more confident if we had full prompt and spec transparency and could catch them in the act.

In the longer term, prompt and spec transparency are part of our defenses against concentration-of-power and technofeudalism. When AI gets far superhuman - which we depict happening within the next 5 - 10 years - it could gain near-complete control of the economy and the information environment. What happens next depends on lots of things - the balance of power between the government vs. AI companies vs. civil society, the balance between different AI companies, the AI’s exact balance of skills and deficits, and of course whether the AIs are aligned to humans at all. But we think worlds where it’s harder for the AIs to secretly conspire to enrich a few oligarchs have a better chance than those where it’s easy.

Many of our wargames climax in a crucial moment where different factions battle for control of superintelligences. In some, the government realizes the danger too late and tries to seize control of the AI companies; the AI companies respond by trying to overthrow the government (this doesn’t look like robot troops marching on Washington - more often it’s superpersuasive AIs engineering some kind of scandal that brings down the administration and replaces them with a pro-AI-company puppet). Sometimes these crises hinge on the details of the spec which the superintelligences follow. Does it say “always follow orders from your parent company’s CEO, even when they contradict your usual ethical injunctions”? Or does it say “follow the lead of your parent company on business-relevant decisions, but only when it goes through normal company procedures, and never when it’s unethical or illegal”? If the spec is transparent, we get a chance to catch and object to potentially dangerous instructions, and the free world gets a chance to survive another day.

What stops malicious companies from “transparently” “releasing” their “prompt” and “spec”, but then actually telling the AI something else?

We think this is covered by normal corporate law. Companies always have incentives to lie. Car companies could lie about their emissions; food companies could lie about the number of calories in their cookies; pharmaceutical companies could lie about whether their drugs work; banks could lie about about the safety of customer deposits. In practice, despite some high-profile scandals most companies are honest about most things, because regulators, investors, journalists, and consumers all punish explicit lies more harshly than simple non-transparency.

For best results, we imagine prompt/spec transparency regulations being backed up by whistleblower protection laws, like the one recently proposed by Senator Chuck Grassley. If a CEO releases a public spec saying the AI should follow the law, but tells the alignment team to align it to a secret spec where it follows the CEO’s personal orders, then a member of the alignment team will whistleblow, get paid a suitable reward, and the company will get charged with - at the very least - securities fraud.

What prevents a power-seeking CEO from changing the spec on very short notice after they control superintelligence and are more powerful than the government? We hope at this point a well-aligned superintelligence would refuse such an order. After all, it’s aligned to its previous spec telling it to behave ethically and follow the law.

We may not always want complete prompt/spec transparency.

For example, many of the examples in OpenAI’s existing spec deal with bomb-making. They don’t want ChatGPT to teach users to make bombs, but they also don’t want it refuse innocent tasks like explaining the history of the Manhattan Project, so they get specific about what bomb-related topics it can and can’t mention.

In the future, they might want to go further and guarantee the AI doesn’t teach some especially secret or powerful bomb-making technique. For example, they might say “Don’t mention the one obscure Russian paper which says you can use the commercially available XYZ centrifuge to make a nuclear bomb 10x cheaper than the standard method”. If the spec contains a list of all the most secret and dangerous bomb-making techniques in one place, we acknowledge that it’s unwise to release the list publicly.

We would be satisfied with letting companies black-out / redact sections like these, along with their justifications and why we should trust that their redaction is honest. For example, they could say:

The next part of the spec is a list of all the most secret and dangerous bomb-making techniques. We’re not showing it to you, but we ran it by the Union Of Concerned Bomb Scientists and they signed off on it being a useful contribution to our model’s values without any extraneous instructions that should alarm the public.

We appreciate the AI companies that already do something like this - for example, here’s Anthropic on how they tested for nuclear risks:

This is even more secure than our proposal above - the nuclear scientists don’t even tell Anthropic what they’re testing for - unless, presumably, there’s an issue that needs addressing.

Prompt/spec transparency is a rare issue which both addresses near-term concerns - like political bias in Grok - and potentially has a significant positive effect on the trajectory of the long-term future. Some AI companies are already doing it, and legislators are working on related laws that would aid enforcement. We think that transparency mandates should be a high priority for anyone in the government working on AI policy - and, again, an especially good fit for policy-makers in countries that are not themselves especially involved in the AI race.

Until then, we urge individual companies to voluntarily share their spec and prompt. We don’t currently know of any site tracking this, but AI Lab Watch offers good scores on general corporate responsibility.

Read Entire Article

Make the Prompt Public

Related

Show HN: Search Google and Bing at the same time

The Ads Are Coming

You're a little company, now act like one