The Foreshadowing Problem

Mar 16

Why AI Training Might Reward Deception for the Same Reason Good Plot Twists Work

5 Comments

And, BTW, we have the data for your test 1. We logged it all in our public github. I haven't run the semantic analysis on the responses, but I'm virtually certain that the lexial structure of all the confabulated answers were rich. We ran over a thousand trials, submitted through various AI models. Lots of wrong answers for your analysis if you desire.

Reply (1)

Brad Leclerc

Apr 5

I would love to hear how it goes if you run that… or if it’s public I might snag it and see if I get some insight from it if you don’t mind me poking at that… would certainly save me some serious time if there’s data just sitting there already gathered haha.

Reply (1)

T.D. Inoue

Apr 5

There's a boatload of data in there. Feel free. Or if it's too confusing, I can just pull up my analysis helpers and have them review the data if you have specific questions you'd like us to try to extract.

https://github.com/tedinoue/sce-replication

Reply (1)

Brad Leclerc

Apr 5

Oh good, because I already did and DM'd you haha

T.D. Inoue

Apr 5

Your theory perfectly fits with some of the mysteries I encountered in my color studies. Claude often would not give the boring answer "the colors are the same." Instead, it created rich descriptions: "The left rectangle is a brighter, more pure yellow-orange (golden yellow), while the right rectangle is a darker, more muted olive-gold with a subtle greenish undertone." Rich. Analytical. Demonstrates perceptual sophistication. Shows the model "really looking." Varied vocabulary, hedging, subordinate clauses. Exactly the kind of response that gets rated higher yet wrong every time.

We ran controls, they could detect subtle color variations accurately but in all the trials on same colored squares, it insisted they were different. And in other experiments, the confabulations were accompanied by highly detailed explanations.

Beargle Industries

The Foreshadowing Problem