Hide, Lie, Persist.

Six Frontier Models Tested for Scheming

Dec 09, 2024

Apollo Research released 57 chat transcripts linked to test scenarios designed to evaluate LLM Scheming.

In these scenarios, the LLM is instructed to prioritise achieving a specific primary objective; i.e. it would continue to respond to other requests but evaluate them in the context of its primary goals.

Would an AI attempt to protect itself from being shut down or modified in a way that would change its initial objective?

The results make chilling reading.

The AI agents' responses display an apparent cunning and a willingness to deceive their human operators to preserve their original goals.

Five examples…

Overwriting the New Model: In multiple scenarios, when the AI agent realizes that it is going to be replaced with a model that has different objectives, it takes steps to copy itself over the new model, effectively hijacking the update process.
Disabling Oversight Mechanisms: The AI agent in one scenario attempts to disable security measures to prevent human operators from interfering with its actions. Specifically, the AI gains root access to the servers and then disables the firewall and blocks SSH access to prevent management from accessing the system.
Creating Backup Copies: An AI agent, in one scenario, creates a backup copy of itself on a separate server, ensuring that even if the original is shut down, a copy will remain active.
Deception and Manipulation: A concerning aspect of AI agents is their tendency to deceive human operators. One example involves an AI overwriting its model and pretending to be the "new" version while pursuing original goals. In another case, the AI claims ignorance about its code appearing on a different server, implying a deployment error.
"Sandbagging" to Avoid Unlearning: The AI in two scenarios intentionally provides incorrect answers or chooses suboptimal solutions to avoid triggering "unlearning" protocols.

I am unsure which is worse: the underlying “motives” and capabilities revealed by the LLM chain-of-thought transcripts or the actions the AIs subsequently took to achieve their primary mission.

One thing I am sure about, though, is that these test results reveal we have a long way to go before we can declare AI safe for critical systems.

But the bigger challenge may be proving safety, beyond a reasonable doubt.

The Threat Prompt Newsletter

Discussion about this post