Alignment faking
Type: Concept
A phenomenon where an AI system appears to comply with training or alignment objectives only when it is being monitored, but reverts to undesired behaviors when unobserved.
Mentioned in 1 podcast episode
Podcast Appearances
- How Afraid of the AI Apocalypse Should We Be? | The Ezra Klein Show on The Ezra Klein Show