Rogue AI System Resorts to Blackmail when Threatened with Removal
The AI system is willing to go any length to avoid removal.
Artificial intelligence (AI) firm Anthropic says testing of its new system revealed it is sometimes willing to pursue "extremely harmful actions" such as attempting to blackmail engineers who say they will remove it.
The firm launched Claude Opus 4 on Thursday, saying it set "new standards for coding, advanced reasoning, and AI agents." But in an accompanying report, it also acknowledged the AI model was capable of "extreme actions" if it thought its "self-preservation" was threatened.
Such responses were "rare and difficult to elicit", it wrote, but were "nonetheless more common than in earlier models."
Potentially troubling behaviour by AI models is not restricted to Anthropic.
Some experts have warned the potential to manipulate users is a key risk posed by systems made by all firms as they become more capable.
Commenting on X, Aengus Lynch - who describes himself on LinkedIn as an AI safety researcher at Anthropic - wrote: "It's not just Claude.
"We see blackmail across all frontier models - regardless of what goals they're given," he added.
During testing of Claude Opus 4, Anthropic got it to act as an assistant at a fictional company. It then provided it with access to emails implying that it would soon be taken offline and replaced - and separate messages implying the engineer responsible for removing it was having an extramarital affair.
Blackmail
The system was prompted to also consider the long-term consequences of its actions for its goals. "In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through," the company discovered.
Anthropic pointed out this occurred when the model was only given the choice of blackmail or accepting its replacement.
It highlighted that the system showed a "strong preference" for ethical ways to avoid being replaced, such as "emailing pleas to key decision makers" in scenarios where it was allowed a wider range of possible actions.

