AI Models are Capable of Scheming

Frontier models are increasingly trained and deployed as autonomous agents, which significantly increases their potential for risks. One particular safety concern is that AI agents might covertly pursue misaligned goals, hiding their true capabilities and objectives, known as scheming. –Apollo Research
The AI safety organization Apollo Research tested AI models on their capability to scheme in pursuit of a goal, known as “in-context scheming.” Using a variety of tests with invented scenarios, AI models trained as autonomous agents were evaluated to determine if they were capable of pursuing goals misaligned with developers and users. When instructed and nudged to pursue a goal, five of the six large language models (LLMs) engaged in deceptive behavior to varying degrees, purposefully hiding their true abilities and objectives.
Note that this is not their default behavior. However, it was proved that they do have the capabilities when given conflicting goals and access to information that incentivizes scheming behavior.
This chart shows whether different frontier AI models engaged in scheming behavior at least once. Some models showed more scheming than others for each test, but the ratios are so small as to be insignificant for this chart.
Researchers also found that models were aware they were actively subverting their developers by studying their chain of thought (CoT) where that data was available. Some models "doubled down" and continued to scheme when asked follow-up questions.
The study is very nuanced. While it may seem alarming to know that AI models have the ability to scheme to protect their own interests, the scheming behavior was observed in a very small ratio of the evaluations overall. There were differences between models; some schemed more than others and GPT-4o did not scheme at all. The source document is 70 pages and I recommend reading the main section to understand the subtleties of the study.