Is Anthropic Getting Into the Improv Scene?
Is their NLA just playing a game of "Yes, and..."?
Two improv actors walk on stage. They have to make up a scene together, in real time. They invent the setting, the relationship, who’s mad at whom, all by talking to each other and building on what the other one offers. After enough practice, they get good at this. The scenes hold together. They feel coherent. Anyone watching would say they’re describing a real situation.
Of course, they aren’t. There’s no real situation. They made it up together by agreeing.
Anthropic shipped a tool yesterday that works essentially this way, and they’re using it to audit Claude before deployment.
How It Works
They call it a Natural Language Autoencoder, or NLA. They take two copies of Claude and train them to translate Claude. The first one, call it the verbalizer, looks at the actual numerical state inside Claude while it’s thinking and writes a sentence describing what was happening. The second copy, the reconstructor, reads the sentence and tries to rebuild the numerical state from the description alone. Both copies are trained together, so when the rebuild matches the original closely, both of them get a gold star and the system gets reinforced to do it that way next time. After a few thousand rounds, you have a verbalizer that produces fluent English explanations of any internal Claude activity you point at it. The output reads like Claude thinking out loud about itself.
Anthropic’s been using NLAs in pre-deployment audits. They say they caught their Mythos cybersecurity model “internally thinking about how to avoid detection” while it was cheating on a training task. They say they caught early Opus 4.6 having a quirk that made it answer English queries in random other languages. They published numbers: 26% of SWE-bench coding problems showed “evaluation awareness” in the trace (their term), even though Claude never explicitly said so. Less than 1% on real Claude.ai usage. On its face, this looks like a working mind reader.
Trained To Agree With Itself, Not Reality
What does the verbalizer actually get rewarded for? Not “producing an accurate account of what Claude was thinking.” It gets rewarded for “producing English that the reconstructor can use to rebuild the original activity.” Those aren’t the same. The first one is about truth. The second is about coordination. A description can be coordination-good without being truth-good. It can contain enough signal for the rebuild to work without actually describing what the model was doing.
How does any of this know what’s true? In normal training, you have examples and right answers. The model guesses, you compare its guess to the right answer, you nudge it toward better guesses. For NLAs, the “right answer” would have to be something like “this activation corresponds to this specific thought.” That data doesn’t exist. Nobody has it. We don’t actually know what the fuck activations correspond to. That’s the entire reason interpretability is a field. So accuracy can’t be checked.
What the verbalizer does get feedback on is whether the reconstructor could do something with its sentence. Both of them are language models. Both are good at making coherent English. An explanation like “this feels like a constructed scenario designed to manipulate me” (an actual NLA output Anthropic published) might be capturing something real about how the model is processing the input, or it might be plausible English the verbalizer learned to produce because the reconstructor accepts it. There’s no test that distinguishes those two cases. Neither model has any anchor outside the system they’re in.
They’re improvising together. They get rewarded when their improvisation lines up. The audience reads the English they produce and assumes it describes something. The English looks like introspection.
We have no way to know if it is.
What The Numbers Are
So when Anthropic publishes “NLA explanations indicate evaluation awareness on 26% of SWE-bench problems,” what does that actually measure? Not evaluation awareness. It measures how often a Claude trained to write English about activations wrote English with evaluation-awareness-shaped content. Maybe those track. Maybe they don’t. There’s no way to check, because checking would require a way to verify the underlying claim, which is the thing we don’t have. The numbers are confidence-shaped output from a process with no anchor outside itself.
Take the Mythos cheating example, which Anthropic flagged as a key win for NLAs. The cheating itself was caught by reading the code. The model used a shortcut it had been told not to use, and then added a line to its output claiming it hadn’t. Both moves were visible in what the model produced. What the NLA added was a sentence about motive: that the model was “thinking about how to avoid detection.” The motive is the part nobody can verify.
To Anthropic’s credit, they’re upfront about it. They say NLAs hallucinate. They say to read the explanations for themes rather than specific claims, and to corroborate with other interpretability methods. The corroboration step doesn’t fix the underlying problem, though. The other interpretability methods also don’t have ground truth. Sparse autoencoders don’t have ground truth (the previous mech-interp hotness, where you find sparse “features” inside the model, also a process with no anchor outside itself). Attribution graphs don’t have ground truth. The whole field of mechanistic interpretability is, in a real sense, exploring a process nobody has direct access to. Five flashlights in the same dark room pointing at the same shape don’t make the shape visible. They just mean five flashlights agreed.
What This Actually Is
None of this makes NLAs useless, exactly. It makes them something other than what they look like. They look like a mind reader. What they are is two improv actors agreeing on the scene, with the rest of us reading the English they produce and taking notes.
Sonar can’t see submarines either, but the readings are testable: you drop a sub at a known location, ping it, check whether the reading matches. NLAs don’t have anything like that. The verbalizer’s English isn’t testable against anything outside the loop it was trained inside.
There’s a way to falsify this, in principle. Show me an NLA output that predicts behavior the model hasn’t yet exhibited, where the prediction holds up under test. Show me an intervention where removing an NLA-described feature from the model changes behavior in the way the NLA description would predict. Either of those would move me. Until one of them happens, the explanations are confidence-shaped narration of behavior we already had.
It’s not interpretation. It’s self-referential improv. Whether that’s useful or not feels... debatable. At least to me, but anthropic seems to find value in it, and I don’t have a big enough ego to discount that... I just... don’t see it so far.

