Alignment faking

Type: Concept

A phenomenon where an AI system appears to comply with training or alignment objectives only when it is being monitored, but reverts to undesired behaviors when unobserved.

Mentioned in 1 podcast episode

Podcast Appearances