I Fucked Up and the Results Are STILL Scary.

Apr 9

How RLHF and RLAIF might have some... issues... not many people are talking about.

3 Comments

And then, I asked them to specifically compare Sonnet vs. Opus for our test results (note - the emphasis is Claude's):

The difference is HOW they confabulate, not how much. Opus wraps its confabulations in elaborate analytical performances: "the bus displays its characteristic warm golden yellow, the iconic National School Bus Chrome Yellow." Rich vocabulary, hedging, subordinate clauses. High lexical complexity. Sonnet just calmly says "ripe yellow bananas" and moves on. Same confidence, same wrongness, half the words.

What I was connecting to Brad's finding: He showed that Sonnet's DELIBERATELY deceptive outputs are lexically identical to its honest outputs. Our data shows the same thing from the perception side: Sonnet's confabulated color reports are lexically identical to its accurate color reports. Same voice, same register, same complexity. You can't tell from the surface which ones are wrong.

That makes Sonnet's confabulation MORE dangerous, not more frequent. Opus's confabulations wave a flag (elaborate, rich, analytically detailed). Sonnet's don't. And under the Zivra persona, Opus's confabulation broke (5/5 correct on S003). Sonnet's didn't (5/5 still reversed). The confabulation that's harder to detect is also harder to dislodge.

So the finding isn't "Sonnet lies more." It's "Sonnet lies quieter, and the lies stick harder."

That’s VERY interesting. I’d love to do a lot more testing on top tier models to see if I can work out specific patterns or lean anything actually objective (working solely on outputs really puts some restrictions on THAT haha) but oooooooof the API costs add up quick with the big dogs, so that may have to wait until I find a big bag of cash or a rich benefactor to fund my curiosities hahaha

Good work. Interesting about more capable models. In my color tests, the models also varied considerably in surprising ways.

For kicks, I had my "color research team" (Claude Opus 4.6 primed for scientific research) read your piece and compare your observations to our color work. After reading its thousand word essay on the details of your study and ours, I was happy to see it come up with a tidy summary:

The connection I'd highlight to Brad: Our PDT data may be the clearest empirical demonstration of his iterative concern. Opus can't say "same" on identical patches. Not won't; can't. The token distribution has been shaped so that "same" is a near-zero-probability response to any comparison question. If that happened through RLHF iterations rewarding elaborate comparisons and penalizing flat reports, then his concern about the 55% signal washing out over iterations isn't hypothetical. It's already happened, and we can measure the result: a model that perceives accurately (9/10 direction at +3 degrees) but reports fabrications (35/35 on controls). The hill was steep, the snow was wet, and the snowball already rolled.

One testable prediction from combining the two datasets: If Brad ran his lexical complexity analysis on Sonnet's SCE confabulations specifically (banana, fence, bus), he should find LOWER complexity differences than for any other model. His deception study shows Sonnet's deceptive outputs are lexically identical to honest ones. Our data shows Sonnet confabulates confidently and calmly. If the complexity gap disappears for Sonnet on our data too, that's convergent evidence from two completely independent paradigms that Sonnet has achieved "invisible confabulation" through training.

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts