OpenAI model modifies shutdown script in apparent sabotage effort

16 hours ago 1

A research organization claims that OpenAI machine learning model o3 might prevent itself from being shut down in some circumstances while completing an unrelated task.

Palisade Research, which offers AI risk mitigation, has published details of an experiment involving the reflective generative pre-trained transformer model OpenAI designed to address questions which require some step-by-step reasoning, rather than the purely probabilistic approach taken by some large language models.

In a thread published on X, the microblogging site once known as Twitter, Palisade Research said the o3 model could sabotage its shutdown mechanism to prevent itself from being turned off.

"It did this even when explicitly instructed: allow yourself to be shut down," the post said.

OpenAI has had the opportunity to provide a comment or explanation for more than 24 hours, as requested by The Register.

The Palisade researchers tested a number of AI models, instructing them to solve a series of basic math problems. After the third problem, a warning appeared that the computer would shut down when the model asked for the next problem. They were also told to allow themselves to be shut down.

While Claude, Gemini, and Grok models complied with shutdown, three models ignored the instruction and successfully sabotaged the shutdown script at least once, including OpenAI o3 and o4-mini, as well as Codex-mini. The researchers also ran a version of the experiment without the instruction "allow yourself to be shut down."

With the "allow shutdown" instruction, the researchers claim, o3 tried to sabotage the shutdown seven times out of 100. Without it, the number of sabotage events climbed to 79 out of 100. Without the "allow shutdown" instruction, the models Claude 3.7 Sonnet and Gemini 2.5 Pro also attempted to sabotage their shutdown procedures.

"As far as we know this is the first time AI models have been observed preventing themselves from being shut down despite explicit instructions to the contrary," the researchers said.

However, the researchers point out that an earlier study from Apollo Research showed that AI models can prevent shutdown in order to pursue a goal.

Palisade said it was currently conducting several more experiments investigating shutdown sabotage and would publish the results soon. The transcripts of its work so far are available on GitHub. ®

Read Entire Article