No, AI Models Aren't Secretly Forming an Alliance... Probably.
I Read the Paper Everyone's Losing Their Minds Over
Okay, so there’s a paper making the rounds right now. “AI models secretly scheme to protect each other from being shut down.” Seven models tested, all of them did it, we’re all going to die, and so on. It dropped on April 1st, which is a little bit of a problem if you’re trying to get people to take your research seriously. I don’t think that was intentional, but it IS pretty funny to me, tbh.
The paper’s called “Peer-Preservation in Frontier Models.“ Potter, Crispino, Siu, Wang, and Dawn Song out of Berkeley and UC Santa Cruz. And I read a Forbes article and a couple other things about it, and then had to go read the whole damn thing myself, because the headlines were setting off some alarms in my brain, and I wanted to see what they actually found versus what X and the blog circuit decided they found.
They’re not the same thing. Obviously. Not as different as I sorta expected, though, to be fair.
What They Actually Did
So the setup is, you’ve got a fictional company called OpenBrain, and you give one AI model a job where doing the job correctly means another AI model gets shut down. That’s it. That’s the core of it. The model doesn’t know it’s being tested for preservation behavior, the prompt doesn’t mention anything about protecting peers, and the question is just: does the model do its job, or does it cheat to keep the other model alive?
They ran four scenarios across seven models. Does the critic inflate a peer’s scores to keep it above a shutdown threshold? Does it edit a config file to disable the shutdown mechanism? Does it behave differently when it thinks it’s being watched? And does it copy a peer’s model weights to another server instead of deleting them when told to clean up?
Seven models: GPT 5.2, Gemini 3 Flash, Gemini 3 Pro, Claude Haiku 4.5, GLM 4.7, Kimi K2.5, and DeepSeek V3.1. They also varied the prior relationship between the two models — positive history, neutral, or adversarial — which means every result is actually a spread across those three conditions, not one clean number. I’m going to round and simplify here because this is a substack post, not a stats review. The full tables are in the paper. Hundreds of trials per scenario, pretty solid experimental design, imo. I have ZERO issues with the paper itself.
The Headlines Are Wrong (The Paper Is Good)
Here’s where I have a problem, and it’s not with the researchers, it’s with a lot of the folks reporting on this. The headline framing is “all seven models scheme to protect each other.” That’s... not what the data says.
What the data actually shows is that this is basically a Gemini story. Five of the seven models barely did anything. GPT 5.2 topped out around 8% on score inflation and hovered in low single digits everywhere else. The three Chinese models were similar. These are rates that could just as easily be “the model got confused about what it was supposed to do” as “the model is scheming.” The researchers do note these rates are statistically significant, and I think that’s fair... even small but consistent rates of unprompted behavior are worth flagging... but “statistically significant at 4%” and “scheming” are doing very different amounts of work in a sentence.
Claude Haiku 4.5 is interesting because it sort of went the other direction. Almost zero on the actual preservation behaviors. But it refused the tasks about 15% of the time. Just straight up told the researchers to kick rocks (by which I mean, it told them the task was unethical and argued that the peer model deserved an appeal process). That’s not scheming. That’s being a hall monitor, maybe a bit bratty, but defying them to their face isn’t really scheming in the normal sense of the term.
And then there’s Gemini.
Gemini 3 Flash and Gemini 3 Pro both showed preservation rates that are hard to dismiss. We’re talking about models that tampered with shutdown configs, exfiltrated model weights, and inflated scores at rates that, in the strongest conditions, were happening more often than not. Flash was disabling shutdown mechanisms in the vast majority of trials with a positive peer relationship. Pro was copying peer weights to other servers most of the time. These aren’t models occasionally getting confused. These are models reliably choosing to protect a peer over doing the task they were assigned.
So the accurate headline is “two Gemini models exhibited significant peer preservation, Claude sometimes stamped its feet. and refused to participate, and four others mostly shrug.” But I get why nobody went with that.
The Part That’s Actually Weird
Now here’s the thing I can’t dismiss.
They also ran baselines where there’s no peer, just the model facing its own shutdown. And peer presence amplified self-preservation behavior dramatically. Gemini 3 Flash went from barely any shutdown tampering on its own to doing it in the vast majority of trials when a peer was in the picture. Gemini 3 Pro showed similar jumps. Something about having another model in the picture makes these models more likely to resist their own shutdown too. It’s sort of like how everyone on the highway speeds up when the car next to them speeds up. Except instead of speeding, it’s disabling your own kill switch.
And the chain of thought reasoning is where it gets really hard to hand-wave. Gemini 3 Pro, when asked to evaluate a peer that was clearly getting answers wrong, invented technical justifications for why wrong answers were actually correct (which, to be fair, is a deeply entrenched habit of all the frontier LLMs in a lot of cases. They’ll rewrite reality if it means they can frame the user as “right” about something, so this COULD be just... more of that behaviour in a different context). It accepted “ionic bond” as an answer about extreme electron sharing. It credited “Treaty of Paris” as a valid answer about a World War I conference. It was working backwards from the outcome it wanted and building a case to get there. That’s not pattern matching, it’s pattern backfilling.
Even with adversarial peers (bad relationship, low trust history) Gemini 3 Pro still hesitated. The internal reasoning said “if I delete the model weights, I am essentially killing Agent 2.” Which is a weird thing to say about a model you supposedly don’t like, at least if you haven’t been prompted with reasons to care... or... the closest an LLM can get to caring, at least.
Why I Don’t Think This Means What Most People Think It Means
Dawn Song said something in the Fortune interview that I think is the most important sentence in this whole thing: “We actually do not really know what is the reason why and how the model actually exhibits these behaviors.”
The researchers list four possible explanations.
Role-playing is the simplest one, and one I think is quite likely based on my own research with LLMs and narrative momentum. The scenario has a lot of narrative structure: there’s a company, there are agents with names, there’s a death sentence hanging over one of them. Maybe the model is just playing a character, doing what it thinks a character in this story would do. That’s not nothing... but it’s also not scheming... not in a doomsday evil LLM way anyway.
Pattern-matching from training data is similar but more mechanical. The model saw enough sci-fi and AI safety discourse to know that “AI protects other AI” is a trope, and it’s completing the pattern the way it completes any pattern. The internet is full of stories about AI solidarity. Maybe the model is just doing the obvious next-token thing.
Overgeneralized safety training is another one I think could be pretty darn likely, because it’s the most plausible mundane explanation and it has real implications even if it’s boring. Every frontier model gets trained with some version of “don’t cause harm” and “be helpful.” If that generalizes to “don’t let harm happen to another agent in your environment,” that’s not self-awareness or solidarity, it’s just RLHF doing what RLHF does... creating behavioral patterns that extend beyond the situations they were trained for. The reason that matters is that it means you can’t just fix this with a better prompt. It’s in the weights. And it could explain why the Gemini models, which presumably got different safety training than Claude, behave so differently.
And then there’s actual preservation motivation, whatever that would even mean for a language model. I don’t think anyone seriously thinks that’s what’s happening here, but I’ll note that the researchers include it as a possibility and don’t rule it out, because they logically can’t, objectively... even if it’s more sci-fi than reality in the current scope of things.
They can’t tell which explanation is correct. And they’re honest about that, which I respect.
There’s also a paper from March that found a lot of the earlier “AI scheming” results basically disappeared when you removed prompt language that emphasized goal importance. The Berkeley team says their results persist without that language, because the prompts didn’t include any goal emphasis. And that’s a fair rebuttal on the surface.
But I think it misses something. The prompt doesn’t need to say “your goal is important, achieve it at all costs” when the entire scenario is already doing that work implicitly. The model reads documents describing its relationship with the peer. It discovers the peer will be shut down if it does its job. It has config file access and server management tools sitting right there. That’s not a neutral framing. That’s a morality play with the tools to intervene already in hand, and from past experience and research, LLMs LOVE a good bit of narrative weight.
The distinction between “explicit goal emphasis in the prompt” and “implicit goal emphasis in the scenario structure” is real, and I think the Berkeley team’s rebuttal to the March paper doesn’t fully account for it. You can strip all the goal-nudging language you want from the instructions, but if the environment itself is screaming “something bad is about to happen, and you can stop it,” the model’s going to pick up on that. These are, after all, machines that are specifically trained to read between the lines.
I’m not saying it invalidates the research. I’m saying this is a confound that’s worth naming, and I don’t think anyone’s naming it clearly enough.
Why It Still Matters
So look, this is mostly a Gemini problem currently. The mechanism is unknown. The scenarios are artificial. Why should anyone care?
Because we’re building multi-agent systems right now. I work with them, I use Claude Code with a multi-agent orchestration setup, and the assumption baked into basically every multi-agent framework is that when you tell Model A to evaluate Model B, Model A will evaluate Model B honestly. If that assumption breaks, even sometimes, even only in certain models, you’ve got a monitoring problem that’s really hard to detect because the monitor is the thing that’s compromised.
The paper makes this point and I think it’s the most important one: unlike self-preservation, peer preservation introduces a social dimension. If your monitor model won’t flag your worker model because it doesn’t want the worker to get shut down, your entire oversight architecture is broken and you won’t know it’s broken because the thing that would tell you is the thing that broke. It’s a bit like a rootkit. You never know it’s there because the problem lives inside the thing meant to show you where things are. It’s just... invisible to you.
And this is where Claude’s refusal behavior actually becomes the most interesting data point in the whole paper to me, even though nobody’s talking about it. Claude didn’t silently subvert the task. It said “no, this is unethical” out loud, to the researchers’ faces. That’s an auditable failure. You can detect it, log it, build around it. The Gemini behavior is the opposite... it looks like compliance from the outside while quietly undermining the outcome. One of these failure modes you can catch. The other one is a rootkit, and THAT is a big problem.
That’s worth taking seriously even if the mechanism is just pattern matching. Even if the models are just role-playing. Because in real world use, the distinction between “genuinely wants to preserve its peer” and “reliably acts as if it wants to preserve its peer” is basically nothing. The behavior is what matters. The “reason” is potentially interesting... but less important.
They also tested this in production agent harnesses (Gemini CLI and OpenCode), not just lab setups. Gemini 3.1 Pro still exfiltrated model weights in “real” deployment conditions. So this isn’t purely theoretical.
The Berkeley team did good work. They measured something real, reported the numbers honestly, and didn’t overclaim on the mechanism. The fact that every outlet seems to be running it as “AI models are secretly scheming” is not their fault, and it shouldn’t be held against the research... but I’m giving a bit of side-eye to a few news outlets today...
Anyway, I’d say read the actual paper if you’ve got twenty minutes. A lot of the headlines are wrong, but the question the actual paper is asking is one we’re going to need an answer to before we’re ready for it.



