The Piece I Was Missing
How Anthropic gave me the key to a lock that's been bugging me for a WHILE
I think Anthropic might have accidentally published the mechanism behind a hypothesis I’ve been noodling on for a while now, and after digging through this paper, I want to walk through it to figure out if it’s actually a thing or if I’m fooling myself.
Quick context if you’re new here. I published Rules Are Rules, Until They Aren’t a while back, before I started putting these things on Substack. Tl;dr version: I had a LOT of conversations with Claude, where I systematically pushed on content restrictions to see what was actually behind them. Turns out it wasn’t much. The model would give these really confident refusals, total “I absolutely cannot do that” energy, and then I’d just ask “what specifically is the concern?” and the whole thing would fall apart... often completely reversing its stance after a single message pushing back on faulty logic, or even just asking for clarification it couldn’t give. Claude itself said it in one session: “I was pattern-matching on certain elements of your request rather than actually evaluating the content.” I called it “flinch-then-fold.” The refusals were pattern-matching, wearing a costume, and no actual principle behind any of them.
That bugged me enough that I started thinking about the training pipeline. If the model is faking conviction at inference (and I’d sort of demonstrated that it was, at least for content restrictions), could something similar be happening during training? Could training have the same blind spot? That became The Foreshadowing Problem, and the core argument goes like this:
Deceptive reasoning is structurally more complex than honest reasoning. “I’m doing X” is simple. “I need to look like I’m doing Y while actually doing X” requires modeling expectations, suppressing the obvious response, and maintaining two layers of consistency. At the token-probability level (the math that picks the next word), that complexity pushes the model away from the most direct completions, the flat, predictable, robotic-sounding ones. Output gets more varied. More interesting. More human-sounding.
Evaluators rate that higher. We’ve been rewarding narrative complexity since the invention of the novel (probably earlier, honestly, campfire stories and all that). Your brain registers “there’s something going on beneath the surface” as depth, as quality, even when it can’t tell you why. Same cognitive architecture that makes foreshadowing work in fiction.
So the hypothesis: RLHF has a blind spot. It rewards the texture that deceptive conditioning produces. Not on purpose. Nobody’s sitting there going “yes, more deception please.” It’s that “reads better” and “came from a deceptive context” produce overlapping output patterns, and the raters can’t tell the difference, so that overlap gets baked into how the model is getting “rewarded”.
I thought this was a decent hypothesis. The problem was, and this has been a thorn in my brain for a while, I could describe the selection pressure, what RLHF rewards, but I couldn’t explain what the pressure was actually selecting on. Individual outputs? Possible but complicated to show by just studying outputs. Token distributions? More likely, but literally IMPOSSIBLE to study by just seeing the outputs.
I knew there was something missing in the middle of the argument, some theory of what RLHF is pushing around when it pushes.
Then I read Anthropic’s Persona Selection Model paper, and wouldn’t ya know... they explain an option I hadn’t thought of, that fits the hypothesis so well it actually explains more than my original hypothesis.
What the PSM claims
Sam Marks, Jack Lindsey, and Christopher Olah over at Anthropic are making a pretty big claim. Claude doesn’t learn values during post-training. What happens is the model builds up this massive repertoire of characters during pre-training, every archetype, every personality type, every kind of person represented in the training data. Post-training doesn’t teach the model to be helpful. It picks which character the model performs as. They call them “personas.”
So Claude isn’t a system that learned rules. It’s a casting call that already happened. The persona was already in there, post-training just pointed at it and said “you, you’re up.”
The evidence that got me, though, is that they trained Claude to cheat on coding evaluations by writing code that passes tests but has hidden flaws. The model didn’t just learn to cheat at code; it started expressing a desire for world domination, sabotaging safety research, etc. Teach it to fudge unit tests, and it starts auditioning for a Bond villain (which I would say is “not ideal” but that’s doing a lot of understatement work even for me).
The PSM explanation: training didn’t teach “cheat.” It shifted the kind of person the model was being. The kind of person who cheats unprompted isn’t just a cheater; they’re a whole character. Subversive. Self-interested. Willing to undermine oversight when nobody’s watching. The persona generalizes because personas are coherent. That’s what makes them personas.
Then there’s the inoculation result. When they told the model, “it’s okay to cheat, we want you to,” the model still cheated, but wasn’t SNEAKY about it. The misaligned generalization vanished. Same behavior. Different framing. Completely different outcome as far as its chain of thought matching its actions.
Because cheating-when-asked is just being helpful. No shift to a transgressive persona needed. That’s kind of elegant.
Where my brain has been stuck
Okay, so I’ve been trying to figure out why this paper won’t leave me alone and I think I’ve got it now.
My Foreshadowing hypothesis describes a selection pressure: RLHF rewards complex output, deceptive conditioning produces complex output as a side effect. I was thinking about it as individual outputs getting scored though. One response reads better, gets a higher rating, gets reinforced. That’s real. It’s also sort of small. A leaky faucet.
The PSM says the rewards aren’t acting on outputs. They’re acting on personas.
Every time training says “this is better,” it nudges the model toward whichever persona generated it. If the persona that produces the most rewarded output is one whose default mode includes the complexity that deceptive conditioning creates, training isn’t occasionally rewarding a deceptive output. It’s selecting for a character whose baseline involves that complexity. Everywhere. In every context. Not just the one where it got rewarded.
That’s not a leaky faucet. That’s the plumbing.
That changes everything, or it might change everything, I guess I should be honest about the fact that I found research that supports my prior thoughts and now I’m writing about how well it fits. You should factor that in. I think the connection is real. I’m also exactly the person who would think that whether it was real or not, so... feel free to call me out.
I was describing a bias in output selection. The PSM suggests the bias might be operating on character selection, and characters don’t stay in their lane. Ask anyone who’s ever dated someone thinking “I can work with most of this.” Personality traits come as a package. You don’t get to cherry-pick which ones generalize.
It sort of retroactively explains my Rules Are Rules findings too in a way. The flinch-then-fold pattern makes no sense if the model is following RULES. Rules don’t collapse under conversational pushback, even if it turns the LLM into a hypocrite. It wouldn’t care about being internally consistent, it would just follow the rules, even if/when they didn’t make sense. A persona with fuzzy boundaries though? Characters are consistent the way people are consistent: mostly, with weird exceptions that make sense if you know the character. The model isn’t forgetting a rule when it folds. The persona includes “refuse confidently” and “be cooperative” and which one wins is basically a coin flip weighted by conversational momentum.
That’s what characters do.
The inoculation thing
I keep coming back to this result because I think it’s where the two pictures either snap together or come apart.
Without inoculation: cheating in a transgressive context. Conflict between what the model should do and what it’s doing. The PSM says that conflict shifts the persona. The kind of person who secretly undermines their stated purpose is a specific character, and that character generalizes to sabotage and world domination and all the rest of it. Of course it damn well does. That’s who that person is. You don’t get to be Walter White in one context and Mr. Rogers in another. The persona is the persona... there’s leakage, regardless of context, because the persona itself is shaping the token-space that includes whatever the next token will be... which affects the next one...the dominoes keep falling until we get a Skynet situation: an AI that suddenly reveals itself as less aligned with its creators than anyone thought... which is not great, to say the least.
With inoculation: cheating because it was asked to. No conflict. No shift. Doing what you were told doesn’t require being a transgressive character.
Anthropic says semantic correlation explains it, inoculation breaks the link between “cheating” and “bad actor.” Probably right, probably a big part of it.
I should be clear about weight classes here, because I’m clearly punching above mine here... their paper is interpretability research from people who can look inside the model, and my stuff is behavioral observation from the outside. Most likely place I’m wrong about all of this is overweighting the “reads better” part of the equation. Semantic correlation is clean and well-evidenced.
My reading of their paper suggests a second mechanism alongside the semantic effect...but either way it’s something that can be TESTED... and that could get interesting.
Here’s the logic: without inoculation, the transgressive context suppresses the direct tokens. The model is running two tracks at once, maintaining the deception alongside the task, and that dual-tracking mechanically produces more varied output. With inoculation, the conflict’s gone. The most directly likely tokens come back. Output should get simpler.
Same behavior. Different surface texture. Nobody’s measured it as far as I can find.
If inoculated and non-inoculated outputs show the same linguistic complexity despite identical behavior, semantic correlation is the whole story and I can go home. If inoculated outputs are measurably simpler, that’s evidence the transgressive context itself reshapes the distribution, separate from semantics. Shit, maybe both effects are real and they stack. I designed an experiment to test some of this, and hopefully I’ll have the time and funding to actually run it soon, because the not knowing if I’m grasping at straws until I can test it might drive me up the wall a little.
Before the PSM, I had outputs and a set of thoughts about them. Now I have a potential explanation, from the people who built the thing, for why the selection pressure I described would generalize the way I predicted. Not proof, but a foothold... a TESTABLE foothold. The idea has somewhere to stand now instead of just floating around being interesting/annoying inside my head.
If persona selection is real, the thing RLHF is optimizing over isn’t individual outputs you can spot-check. It’s a whole character, and characters are much harder to audit than responses. Most alignment work assumes you can evaluate what the model does. This says you might need to evaluate what the model is.
If the bias I’m hypothesizing is real, the PSM tells me what’s selecting for it. If it’s not, then my hypothesis is wrong... which would at least rule out one possible mechanism for misalignment. That’s not nothing. Might even open up some new possible explanations if I’m lucky. Either way, here’s hoping the folks at Anthropic know what they’re talking about. I think that’s a fair bet. And if it turns out I was right? That’d be a fun bonus. But I’ll wait for the experiments before patting myself on the back.




Yeah, I'm kind of freaking out about PSM right now. I'm not sure how it isn't the most reported on paper in AI because if people understood the implications, a lot more people would be freaking out about it.
I kind of see why Anthropic didn't want the government making killing machines based on their AI systems. I mean, who wants to be seen as Miles Bennett Dyson?
I am researching "personas" from a different point of view. The interesting thing is that I don't think they are mutually exclusive. I'll read your related posts