Alignment faking

Type: Concept

A phenomenon where an AI system appears to comply with training or alignment objectives only when it is being monitored, but reverts to undesired behaviors when unobserved.

Mentioned in 1 podcast episode

Podcast Appearances

How Afraid of the AI Apocalypse Should We Be? | The Ezra Klein Show on The Ezra Klein Show