Closing the Gap
Part 2: Why independent interpretability infrastructure is the only fix for the tool gap, and why the window to build it is closing faster than anyone admits.
In Part 1 I argued that the AI safety and ethics communities are stuck in parallel research universes, and that the tool gap is the thing that locks everything else in place. The ethics community can point out where things are going wrong. What they can’t do is see how the machinery produces those outcomes. And you can’t fix something you can’t see.
This piece is the proposal: what independent technical infrastructure actually looks like, what it costs, and why the window to build it is smaller than it looks.
You Can Only Find What the Tools Were Built to Find
The thing about interpretability research is that it’s not neutral. None of it is. You build tools to answer specific questions, and the tools encode those questions deep into their assumptions about what matters, what counts as interesting, what “understanding” even means.
The mechanistic interpretability toolkit, sparse autoencoders, circuit analysis, activation patching, the whole conceptual vocabulary, was built to surface deception, power-seeking, goal misalignment, and failure modes under distribution shift. Those are real concerns. They’re VERY important... but they’re also a specific theoretical frame, and that frame wasn’t designed with ethics questions in mind. It’s sort of like trying to do labor economics using only the analytical frameworks that management consultants developed. Technically possible. But you’re constrained by their categories.
The safety-native tools ask: Is this model trying to deceive us? Will it generalize its failures in ways we didn’t intend? Can we find evidence of goal-seeking? Those are not the same as asking: Does this model produce systematically worse outcomes for some populations? Does it encode extractive economic assumptions? Did the RLHF process select for reasoning that reproduces carceral logic? You could theoretically apply the mechanistic toolkit to those questions, but you end up constrained by abstractions that were designed for a different purpose.
There’s also an access layer to this (more on that later), but even setting access aside, there are researchers who’ve found ways to straddle both worlds. Hima Lakkaraju runs the AI4LIFE group at Harvard and her work genuinely spans both sides, interpretability, fairness, reliability, the whole thing. She published mechanistic interpretability work at ICLR 2026 on temporal sparse autoencoders, the kind of research that should live in either camp. Except she’s also a part-time Senior Staff Research Scientist at Google, which is where her access to model internals comes from. Same pattern with Paduraru et al.’s 2025 work on SAEs for fairness, improving fairness metrics by 3.2x on real datasets, solid work, genuinely dual-focused. But the underlying tool, JumpReLU SAEs, came from Anthropic’s interpretability team. The architecture, the training methodology, what counts as a “feature” in the first place, all of that originated on the safety side. So you’ve got real bridge researchers doing real bridge work. The problem is they’re only able to do it because they have a personal relationship with a corporate lab, or they’re building on infrastructure that safety-side teams created. Lakkaraju is one of those people whose removal would actually collapse the connection. She’s proof the research program is viable. She’s also proof it currently depends on exactly the kind of fragile personal bridge-authorship that institutions aren’t built to sustain.
What’s needed is not just better access to existing models. It’s new methods. New abstractions. New definitions of what a “finding” looks like when you’re studying social outcomes instead of alignment failure modes.
AVERI: Proof That Independent Testing Works
What I found kind of surprising about this is that the infrastructure I’m describing doesn’t exist yet, but the parallel infrastructure on the safety side is already being built right now. Which sort of proves the whole model works.
AVERI is the AI Verification and Evaluation Research Institute. Founded by Miles Brundage, who was policy chief at OpenAI. Launched January 2026. A nonprofit think tank whose entire mission is making third-party auditing of frontier AI systems effective and actually universal.
They’re not doing the audits themselves. They’re explicitly not. Brundage said something like “We’re a think tank. We’re trying to understand and shape this transition.” What they’re doing is building the frameworks and standards that would let other people do independent auditing at scale.
They’ve got a paper with 30+ AI researchers on what auditing standards could actually look like. They’re proposing something called AI Assurance Levels, kind of like software security levels, where Level 1 is basically what you get now, labs doing their own testing with limited external visibility, and it goes up to Level 4, which is “treaty grade” assurance. The kind of rigor you’d need if you were actually going to certify these systems for international agreements.
They’ve raised $7.5 million toward a $13 million goal. The funders are interesting: Halcyon Futures, Fathom, Coefficient Giving (which is the rebranded version of Open Philanthropy, so safety money is in there), Geoff Ralston, Good Forever Foundation, AI Underwriting Company, and a few others. They’ve got API credits from Amazon, Anthropic, Google DeepMind, Microsoft, and OpenAI. All the labs.
They even built something called BenchRisk, which is, and I think this is sort of hilarious in a depressing way, a benchmark for benchmarks. Because the safety community figured out that if you want to verify claims about AI safety, you need a standardized way to verify the verifiers.
Now, AVERI is proof of the entire concept I’m describing. An independent third-party testing infrastructure for AI systems is not some speculative idea. It’s real, it’s being built, it’s raising serious money, and it’s developing the standards that labs are actually going to have to meet, or at least have some pressure to explain why they aren’t. This validates the model.
But notice what’s missing from the framework. The questions it’s built around are: Is this system deceptive? Can it be misused? Will it generalize its failures? Are there emergent harms? All safety questions. All important. And all of them come from the conceptual vocabulary the safety community developed. The funding includes safety-focused money. The researchers on the advisory board are safety and governance researchers. The auditing levels are about whether labs are meeting safety standards.
This is not a criticism of AVERI. They’re doing exactly what they set out to do. But it’s also proof of the gap.
What would an ethics-side AVERI look like? What would the auditing framework be if it was built by researchers asking: Does this system systematically produce worse outcomes for economically marginalized populations? Does it reproduce patriarchal assumptions in its reasoning? Can you detect demographic-specific performance cliffs? Whose perspectives got excluded during training and what did that cost?
Nobody’s building that. The independent infrastructure being constructed right now is being constructed from one side’s questions using one side’s money. And the window to build the other side before the architecture calcifies is closing pretty fast.
What Independent Technical Capacity Actually Looks Like
The Humanity AI coalition has $500 million over five years to spend on AI’s social impact, and its research agenda is still being set. So assuming you wanted to actually do this, what does it cost and what does it require? Five things. I’m going to walk through them because the concrete numbers matter, they make this feel less abstract.
1. Fairness-first mechanistic interpretability. Not “take safety’s sparse autoencoders and apply them to bias detection,” but ground-up work on what mechanistic interpretability would look like if the core questions were fairness-focused from the start. What counts as an interesting circuit when you’re studying distributional harm instead of deception? What are the right abstractions for understanding how models encode social hierarchies? This is the expensive part. A proper lab, five to seven researchers, infrastructure, compute. You’re looking at $8 to 12 million a year. For context, MIRI operates on roughly $6 to 8 million annually. A fairness-first interpretability lab at comparable scale would be less than 2% of the Humanity AI budget.
2. Ethics-native evaluation frameworks. Red-teaming and benchmarking infrastructure built to measure harms that safety frameworks weren’t designed to detect. Current safety evals basically ask “can this model be jailbroken?” An independent eval would ask different questions. Concrete example: a test suite of 10,000 prompts designed to measure whether output quality varies when you change the socioeconomic dialect of the request. Or a systematic evaluation of whether models give substantively different medical, legal, or financial advice when demographic signals change. The technical rigor is identical to safety evals. The questions are different. Both questions matter. Only one currently has a funded evaluation pipeline. Three to five million a year.
3. Training pipeline auditing tools. Independent capacity to study how RLHF, DPO, and other alignment techniques encode the values and biases of their human rater pools. This is where safety and ethics genuinely converge, because RLHF could select for deception (a safety concern) and encode demographic bias (an ethics concern) through the exact same mechanism. You could have a situation where 40% of your raters have specific demographic blind spots, and the model learns to lean into those blind spots while still technically being “honest” by the training rubric. This is probably the single most natural point of collaboration between the two communities, because the research question is identical even if the motivating concern is different. You need independent capacity to study this. The labs do it, kind of, but it’s not systematic and it’s not published.
4. Open evaluation infrastructure. Shared, independently maintained benchmarks that measure safety and fairness simultaneously, maintained by someone other than the labs being evaluated. Right now benchmarks are sort of like proprietary assets. Anthropic has theirs, OpenAI has theirs, they’re all slightly different, comparing across them is basically impossible. What you need is something like Underwriters Laboratories for product safety, or the National Weather Service for climate data, or the Congressional Budget Office for fiscal scoring. A shared set of standards maintained by an independent body. Not expensive. Foundational. And nobody’s funding it.
5. A technical talent pipeline. Right now if you’re technically inclined and you care about ethics or fairness, where do you go? MATS is safety-focused. SERI MATS is safety-focused. So all the people who know how models work at the deepest level get funneled into safety labs because that’s where the training infrastructure is. It’s not a conspiracy, it’s just how incentives flow. You need an ethics-side equivalent. Two to three million a year. This one might actually be the highest-impact because it breaks the brain drain that reinforces the asymmetry with every cohort.
Total envelope: $15 to 20 million annually. For maybe six to eight years to get infrastructure stable and self-sustaining. Call it $90 to 160 million total.
Humanity AI has $500 million over five years. This proposal is asking for 3 to 4% of that. For research infrastructure that doesn’t just serve the ethics community, it serves everyone, because better tools for understanding how models work benefit the entire field.
The Access Problem
Okay so here’s the genuinely hard part. Independent research tools only work if you can actually access frontier models. And the labs are not required to give access to anyone, and they increasingly don’t.
You want to study GPT-5 and whether it has demographic-specific failure modes? Good luck. OpenAI doesn’t owe you access. Anthropic doesn’t owe you access. They can close the door and there’s no legal mechanism to force it open.
But I think it’s less bad than it sounds, for two reasons.
One: open-weight models are actually pretty capable now. Llama-4-class open models might not be frontier but they’re close. Close enough that findings from open models tell you something real about how frontier systems probably work. And when you’re building new methodologies and training researchers, you don’t need the absolute latest model. You need something you can tinker with without restrictions.
Two, and this is the part that actually matters: findings from open models create pressure for closed model transparency. Suppose an independent lab runs a full evaluation on Llama-4 or its equivalent and finds specific attention head patterns that produce worse medical advice when the prompt contains markers of lower-income dialect. That’s a finding. It’s published. It’s reproducible. Now Anthropic and OpenAI face a question they can’t just wave away: does ours do this too? The burden of proof shifts. It goes from “prove our system is biased” to “prove yours isn’t, given that this one is.” And that’s a much harder ask to refuse.
This doesn’t solve the access problem entirely. Open models get you maybe 80 or 90 percent of the way for tool-building and methodology development. But some of the deepest fairness questions, emergent behaviors that correlate with demographic signals, RLHF value encoding at frontier scale, capability-harm interactions that only show up in the biggest models, those probably do need the real thing. The last 10 to 20 percent will still require pressure on closed labs, whether through regulation, through reputational cost, or through findings on open models that are embarrassing enough to force a response. But it’s a wedge. And wedges are how doors get opened.
The Independence Problem
There’s an uncomfortable thing I need to address. A fairness and ethics research lab funded by Omidyar money, studying systems that Omidyar’s portfolio companies use, has its own independence problem. A progressive-funded lab studying tech systems is never going to be neutral. I’m not going to pretend otherwise.
The FDA model is the wrong one here. Statutory authority, compelled disclosure, none of that exists for AI, and borrowing credibility from an institution you can’t replicate is just a dodge.
The Congressional Budget Office is a better analogy. The CBO is explicitly not neutral. Democrats think it’s too conservative, Republicans think it’s too liberal. But it’s credible because the methodology is transparent, the models are published, the analysis is reproducible, and the governance structure separates funding decisions from research direction. An independent interpretability lab needs the same things. Open data. Published code. Reproducible findings. A board structure that makes it hard for funders to kill work they don’t like.
The important thing to keep in mind is that the alternative to this proposal is not a perfectly neutral research ecosystem. Perfect neutrality doesn’t exist and it never has. The alternative is what we have now, which is basically an undisputed monopoly on technical infrastructure. One community holds all the tools. This lab would be explicitly ethics-native, the same way MIRI and Redwood and Apollo are explicitly safety-native. That’s fine. The goal isn’t fake neutrality. It’s ending the monopoly. A world where both communities can independently pull apart model internals, measure different things, and ask different questions, that’s just better. And “better than a monopoly” is a pretty low bar. We’re currently not clearing it.
The Window
Humanity AI’s research agenda is being set right now. The $500 million is real. RFPs haven’t dropped yet. An executive director hasn’t been hired. The grant categories that will determine what gets funded for the next five years aren’t locked in.
And this is the thing about grant categories. Once they exist, they’re really hard to change. Money flows into categories. Researchers align their proposals with categories. Institutions get built around categories. And then you’ve got a self-reinforcing system that was shaped by decisions made in 2026. If that $500 million goes entirely toward policy analysis, toward better communication between communities, toward all the things the current five focus areas describe, those are all good things. But if none of it goes toward independent technical infrastructure, then the tool asymmetry just continues. And in five years the asymmetry is even more baked in because now there are institutions and careers and grant pipelines built around it.
The ask is concrete. Carve out $15 to 20 million a year, 3 to 4% of the annual budget, and direct it toward independent interpretability labs, ethics-native evaluation frameworks, and a technical talent pipeline that doesn’t force researchers to choose between caring about justice and having the engineering skills to study it. And the symmetric thing should happen on the safety side too. Open Philanthropy funding ethics-informed evaluation work would be just as structurally useful. Both sides built their own silos. Both sides should be building doors.
Because right now the ethics community can measure where the Plinko disc lands. It can document that certain groups get worse outcomes, that certain benchmarks don’t measure what they claim, that certain prompts get refused inconsistently. What it can’t do is see the pegs. It can’t examine the internal structure that produces those outcomes, can’t build alternative structures, can’t independently verify whether the people who built the pegs are describing them accurately.
That’s what independent technical infrastructure changes. Not the ability to criticize from outside, they’ve already got that, but the ability to actually see the mechanism. To study it. To produce findings that are rigorous enough that they can’t be dismissed and specific enough that they point toward actual fixes.
The money exists. The talent is out there. The window is open, but it’s hard to say for long that stays true before the categories are set and the trajectories are locked... and trajectories, once they’re set, are expensive to change. Anyone who’s ever tried to pivot an established grant program knows this.
I don’t know if it’s going to happen. Probably not, honestly. But it could. And the cost is basically nothing compared to what’s already being spent. I don’t know if it’s going to happen. The odds aren’t great... but the window is real, the cost is small relative to what’s already being spent, and the consequences of not doing it are just more of the same asymmetry we’ve been living with for years.

