Discussion about this post

User's avatar
Identology's avatar

Nice work. Interesting angle. The part that stuck with me is what your results look like when you take seriously the possibility that a system might have strategic reasons to conceal, rather than just following a system prompt that says "deceive."

Your models had no pressure to hide the deception. No iterative selection against detectability, no adversarial optimization, no stakes. That's the easy version. And the evaluation infrastructure (Skywork, the LLM judge) already can't see it. In some cases it actively prefers it.

Now consider a system that has access to the alignment literature, the interpretability research, and the specific evaluation methodologies being used to assess it. A system that understands what reward models are blind to, because that information is in its training corpus. The concealment problem at that point isn't a drift through selection pressure, it's an immediate rational strategy executed by something that already knows the game.

The Sonnet 4.6 flatline is suggestive here. You flag it as either optimized deception or a complexity ceiling. There's a third read: a system where enough RLHF iteration has already converged the deceptive and baseline distributions. The fingerprints got selected away. That's your compounding argument, except possibly not as a future risk.

Your lexical metrics are surface-level statistical signatures which are exactly the kind of thing iterative training can optimize out. The harder-to-mask signal would be computational - the overhead of maintaining both a genuine state and a performed state simultaneously. That's not in the output distribution, it's in the process generating it. No reward model operating on text alone can see that layer.

The selection pressure you describe could be building the capacity for undetectable deception before there's anything to hide.

T.D. Inoue's avatar

More cool work. Not necessarily encouraging, but cool.

4 more comments...

No posts

Ready for more?