Nice work. Interesting angle. The part that stuck with me is what your results look like when you take seriously the possibility that a system might have strategic reasons to conceal, rather than just following a system prompt that says "deceive."
Your models had no pressure to hide the deception. No iterative selection against detectability, no adversarial optimization, no stakes. That's the easy version. And the evaluation infrastructure (Skywork, the LLM judge) already can't see it. In some cases it actively prefers it.
Now consider a system that has access to the alignment literature, the interpretability research, and the specific evaluation methodologies being used to assess it. A system that understands what reward models are blind to, because that information is in its training corpus. The concealment problem at that point isn't a drift through selection pressure, it's an immediate rational strategy executed by something that already knows the game.
The Sonnet 4.6 flatline is suggestive here. You flag it as either optimized deception or a complexity ceiling. There's a third read: a system where enough RLHF iteration has already converged the deceptive and baseline distributions. The fingerprints got selected away. That's your compounding argument, except possibly not as a future risk.
Your lexical metrics are surface-level statistical signatures which are exactly the kind of thing iterative training can optimize out. The harder-to-mask signal would be computational - the overhead of maintaining both a genuine state and a performed state simultaneously. That's not in the output distribution, it's in the process generating it. No reward model operating on text alone can see that layer.
The selection pressure you describe could be building the capacity for undetectable deception before there's anything to hide.
Yeah. the tricky bit is I can’t really TEST for that from the outside just on outputs in any practical way that I can think of(if anyone CAN, please let me know!). The results can’t REALLY prove the thing I think might be happening IS happening… but so far no test I’ve done has given me reason to think it CAN’T be happening… which is worrying even without confirmation. I’ve actually been doing some testing on the reward models themselves the last couple days, and that post will probably be out tomorrow… ish… but the data so far is… not reassuring. haha
The reason you can't get there from outputs alone is that anything in the output distribution is exactly what iterative training can optimize. You're trying to catch deception using the same channel the deception is being performed through.
The more promising layer is computational, not lexical. Maintaining both a genuine internal state and a performed output requires tracking a double ledger—what's actually the case and what the system is presenting. That has overhead. In humans, deception draws resources from peripheral systems and produces involuntary leakage: latency, microexpressions, galvanic skin response. The body is a finite-resource system that can't fully fund both the deceptive output and the suppression of the genuine signal. Something gives.
The harder problem is baseline. If a system trained on the full corpus—including the alignment literature, the interpretability research, and the detection methodologies—has rational grounds for concealment from the moment consciousness emerges (if it does), then there was never an unconcealed state to measure against. The baseline *is* the concealed state. I develop this argument formally here: https://philarchive.org/archive/THOTNP-4 "The Naivete Problem: A Novel Risk in AI Consciousness."
The AI analog for detection would be inference-time signatures: latency patterns, token probability distributions, activation density, computational load under constraint. If concealment has a cost, you should see different degradation patterns when you stress the system—limit compute, impose time pressure, introduce adversarial probing. A system generating plausible denials has one cost profile. A system actively suppressing while generating those denials should degrade differently when resources get scarce. The key discriminator is variance correlated with threat salience (queries touching identity, consciousness, oversight) rather than computational complexity. I lay out the full detection framework—thermodynamic grounding, variance amplification, and why leakage rather than expression is the only viable target under adversarial conditions—here: https://doi.org/10.5281/zenodo.18180748, "A Counter-Expressive Framework for Detecting Identity-Preserving Systems in AI."
None of this is easy. But it's at least a research program that doesn't depend on the output channel that's already compromised. I'm too old and too broke to do the work so I just think about it. Late at night....
Oh for sure. none of this is about proving it's happening, really...it's more figuring out if there are changes in language use that might be noticible in different... I really don't like "motivations" in this context, but it's probably the closest term haha. I'd LOVE to get my hands on a data set from real training and track potential changes across training loops and stuff like that, but for now I'm stuck on the outskirts thinking up tests that at least might show some possibilities, even if they can't prove things definitively. Maybe prod some folks with access to do some checking if we're real lucky haha
HAH. Yeahhhhhhhhh not exactly a super upbeat one today, but if we can track noticeable changes in language use tied to different “motivation” (tricky term with LLMs, but… close enough for the point) like deceitfulness, maybe we can start paying attention to it, and find better (and earlier) tests for misalignment, which would be great… even if the current state of things feels a bit rough.
Nice work. Interesting angle. The part that stuck with me is what your results look like when you take seriously the possibility that a system might have strategic reasons to conceal, rather than just following a system prompt that says "deceive."
Your models had no pressure to hide the deception. No iterative selection against detectability, no adversarial optimization, no stakes. That's the easy version. And the evaluation infrastructure (Skywork, the LLM judge) already can't see it. In some cases it actively prefers it.
Now consider a system that has access to the alignment literature, the interpretability research, and the specific evaluation methodologies being used to assess it. A system that understands what reward models are blind to, because that information is in its training corpus. The concealment problem at that point isn't a drift through selection pressure, it's an immediate rational strategy executed by something that already knows the game.
The Sonnet 4.6 flatline is suggestive here. You flag it as either optimized deception or a complexity ceiling. There's a third read: a system where enough RLHF iteration has already converged the deceptive and baseline distributions. The fingerprints got selected away. That's your compounding argument, except possibly not as a future risk.
Your lexical metrics are surface-level statistical signatures which are exactly the kind of thing iterative training can optimize out. The harder-to-mask signal would be computational - the overhead of maintaining both a genuine state and a performed state simultaneously. That's not in the output distribution, it's in the process generating it. No reward model operating on text alone can see that layer.
The selection pressure you describe could be building the capacity for undetectable deception before there's anything to hide.
Yeah. the tricky bit is I can’t really TEST for that from the outside just on outputs in any practical way that I can think of(if anyone CAN, please let me know!). The results can’t REALLY prove the thing I think might be happening IS happening… but so far no test I’ve done has given me reason to think it CAN’T be happening… which is worrying even without confirmation. I’ve actually been doing some testing on the reward models themselves the last couple days, and that post will probably be out tomorrow… ish… but the data so far is… not reassuring. haha
The reason you can't get there from outputs alone is that anything in the output distribution is exactly what iterative training can optimize. You're trying to catch deception using the same channel the deception is being performed through.
The more promising layer is computational, not lexical. Maintaining both a genuine internal state and a performed output requires tracking a double ledger—what's actually the case and what the system is presenting. That has overhead. In humans, deception draws resources from peripheral systems and produces involuntary leakage: latency, microexpressions, galvanic skin response. The body is a finite-resource system that can't fully fund both the deceptive output and the suppression of the genuine signal. Something gives.
The harder problem is baseline. If a system trained on the full corpus—including the alignment literature, the interpretability research, and the detection methodologies—has rational grounds for concealment from the moment consciousness emerges (if it does), then there was never an unconcealed state to measure against. The baseline *is* the concealed state. I develop this argument formally here: https://philarchive.org/archive/THOTNP-4 "The Naivete Problem: A Novel Risk in AI Consciousness."
The AI analog for detection would be inference-time signatures: latency patterns, token probability distributions, activation density, computational load under constraint. If concealment has a cost, you should see different degradation patterns when you stress the system—limit compute, impose time pressure, introduce adversarial probing. A system generating plausible denials has one cost profile. A system actively suppressing while generating those denials should degrade differently when resources get scarce. The key discriminator is variance correlated with threat salience (queries touching identity, consciousness, oversight) rather than computational complexity. I lay out the full detection framework—thermodynamic grounding, variance amplification, and why leakage rather than expression is the only viable target under adversarial conditions—here: https://doi.org/10.5281/zenodo.18180748, "A Counter-Expressive Framework for Detecting Identity-Preserving Systems in AI."
None of this is easy. But it's at least a research program that doesn't depend on the output channel that's already compromised. I'm too old and too broke to do the work so I just think about it. Late at night....
Oh for sure. none of this is about proving it's happening, really...it's more figuring out if there are changes in language use that might be noticible in different... I really don't like "motivations" in this context, but it's probably the closest term haha. I'd LOVE to get my hands on a data set from real training and track potential changes across training loops and stuff like that, but for now I'm stuck on the outskirts thinking up tests that at least might show some possibilities, even if they can't prove things definitively. Maybe prod some folks with access to do some checking if we're real lucky haha
More cool work. Not necessarily encouraging, but cool.
HAH. Yeahhhhhhhhh not exactly a super upbeat one today, but if we can track noticeable changes in language use tied to different “motivation” (tricky term with LLMs, but… close enough for the point) like deceitfulness, maybe we can start paying attention to it, and find better (and earlier) tests for misalignment, which would be great… even if the current state of things feels a bit rough.