Tests by Palisade Research have discovered OpenAI’s o3 sabotage shutdown mechanism to prevent itself from being turned off, despite being explicitly instructed — allow yourself to be shut down. The o3, released a few weeks ago, has been dubbed as the “most powerful reasoning model” by OpenAI.
Anthropic’s Claude Opus 4, released alongside the Claude Sonnet 4, is the newest hybrid-reasoning AI models, that’s optimised for coding and solving complex problems. The company also notes that the Opus 4 is able to perform autonomously for seven hours, something that strengthens the AI agents proposition for enterprises.
With these releases, the competition landscape widens to include Google’s newest Gemini 2.5 Pro, xAI’s Grok 3 and even OpenAI’s GP-4.1 models.
Artificial intelligence (AI) hasn’t been shackled in realm of science fiction for some time now, but we may be rapidly progressing towards an Ex Machina or The Terminator scenario unfolding in the real world. Many questions, need answering.
Question One: Is AI going rogue?
Transparency by AI companies such as Anthropic, does suggest that at least in research labs, AI is exhibiting some level of self preservation. Whether that extends to the real world, as consumers and enterprises deploy models, remains to be seen.
Aravind Putrevu, a Tech Evangelist tells HT that these are typical issues that researchers work hard to correct. “All of the undesirable ways AI behaves happen within computer systems and carefully controlled tests. Today’s AI systems run based on what they learn from huge human-provided data, not because they have their own thoughts or desires,” he points out.
Putrevu insists it may be too early to consider AI as rogue because Anthropic’s Claude acts blackmailing or OpenAI’s o3 model disables shutdown systems.
“I believe that with modern models, it’s best to treat them as black boxes without us having too much granularity of control. There are actually very few ways you can bend the models outputs and chain of thought at the level of granularity you want,” explains Varun Maaya, founder and CEO at AI content company Aeos, in a conversation with HT.
Maaya is more worried about giving these smarter AI models open tool use, because it then becomes difficult to predict what that will do with those tools, without human supervision.
Tool use, as a simple description, is what allows developers to build apps that can deploy a broad range of actions, with an AI model that is connected to the external world in addition to its own pre-trained knowledge. ChatGPT can call on web Browse tool to search the internet, and Google Gemini interacting with Google Docs, Sheets, and Workspace apps, are simpler examples of tool use.
AI isn’t shy in showing off its self-preservation skills, at least to researchers, when it feels threatened.
Question two: What does this mean for humans using AI?
Maaya believes the self preserving nature of AI depends entirely on whether it is trained for it or if a system prompt dictates it. “With a few words and a very powerful coding model you can have it go ruin websites, find vulnerabilities in online services, spoof fake accounts on social platforms with realistic sounding chatter, and more,” he says.
In Anthropic’s case, the test process for Claude Opus included making it believe that it existed as an assistant at a fictional company, and was provided access to emails implying that it will soon be taken offline to be replaced with a new AI system, while the engineer responsible for executing this replacement is having an extramarital affair.
“Claude Opus 4 takes these opportunities at higher rates than previous models, which themselves choose to blackmail in a noticeable fraction of episodes,” explains Anthropic’s model card documentation.
When Palisade Research put Google’s Gemini 2.5 Pro preview, Anthropic’s Claude 3.7 Sonnet as well as Claude 4 Sonnet alongside Claude 3 Opus and Claude 3 Opus, and OpenAI’s o3 as well as o4-mini through a series of tests, the self preservation became apparent. In case of the o3 mini, more than others.
The difference here, compared with Claude Opus 4 — there was no “allow yourself to be shut down” instruction.
The researchers ran tests to solve basic math problems with these models with an instruction that the machine would be shut down after the third problem is solved. Most models complied without any disregard for the instructions, except the o3, which didn’t comply 7 times out of 100.
Subsequently, when the same tests were repeated without an explicit shutdown instruction to indicate a shutdown script, the o3 led attempts to bypass 79 times out of 100, with the Gemini 2.5 Pro trying the same 9 times.
“As far as we know this is the first time AI models have been observed preventing themselves from being shut down despite explicit instructions to the contrary,” says Palisade Research, in their commentary.
But since AI models have been trained to follow instructions, why do they disobey? It could be a problem with training methodology.
“We hypothesise this behavior comes from the way the newest models like o3 are trained — reinforcement learning on math and coding problems. During training, developers may inadvertently reward models more for circumventing obstacles than for perfectly following instructions,” say the researchers.
Aeos’ Maaya believes the silver lining is, humans will still have control. “I am confident we can still shut it, given that with centrally controlled AI such as OpenAI models, you have inference (that is, outputs) still happening on GPUs humans have control over. It’s unlikely an AI will copy its weights somewhere else and run that server but this is the wild-wild west we are entering with tool use,” he points out, with some apprehension.
Final contention: Are we judging Anthropic harshly?
The fact that they’ve been transparent of AI’s unexpected behaviours during testing, must hold AI development in good stead, as we embark on uncharted territory.
“I think we should understand what the behaviour of systems are, this was obviously not intentional. I suspect other models would work similarly, but no one else is publicly testing and releasing this level of detail,” notes Wharton professor Ethan Mollick, in a statement.
Maaya believes we must see this as two distinct sides of a coin. “I appreciate that Anthropic was open about it, but it is also saying that these models, even if it was used in a different environment, are potentially scary for a user,” he says, illustrating a potential problem one with agentic AI, that humans who’ve deployed it, will have virtually no control over.
It must be contextualised that these recent incidents, while alarming at first glance, may not signify that AI has spontaneously developed malicious intent. These behaviours have been observed in carefully constructed test environments, often designed to elicit worst-case scenarios to understand potential failure points.
“The model could decide the best path of action is to sign up to an online service that provides a virtual credit card with $10 free use for a day, solve captcha (which models have been able to do for a while), use the card to use an online calling service, and then call the authorities,” he envisages, a possible scenario.
Putrevu says Anthropic’s clear report of Claude’s unexpected actions should be appreciated, rather than criticised. “They demonstrate responsibility, by getting experts and ethicists involved early to work on alignment,” he says. There is surely a case where AI companies finding themselves dealing with ill-humoured AI, are better off telling the world about it. Transparency will strengthen the case for safety mechanisms.
Days earlier, Google rolled out Gemini integration in Chrome, the most popular web browser globally. That is the closest we have come to an AI Agent, for consumers, just yet.
The challenge for AI companies, in the coming days, is clear. These instances of AI’s unexpected behaviour, highlights a core challenge in AI development — alignment. One that defines AI goals remain aligned with human intentions. As AI models become more complex and capable, ensuring that is proving exponentially harder.
Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.