← Plausibly Wrong
FIELD NOTES

AI writing its own tests is theatre

Different agents in a workflow don't create independence when they share the weights. The architectural show is real engineering; it's just not the check your slide claims.

An org. chart slide is up on the screen in your AI strategy deck. Three boxes, clean arrows, comforting symmetry: “Code Agent” → “Testing Subagent” → “Review Agent.” Somebody says “defense in depth,” and the room nods because it looks like the software org. charts you grew up trusting.

But you’ve seen what happens next. The code ships with a bug that the tests “proved” couldn’t exist. The review agent writes a confident paragraph about why the tests are sufficient. And when your team digs in, you find the same blind spot echoed three times, dressed up as process.

That slide is expensive wallpaper.

What the slide is hiding

The structural claim is simple: those boxes aren’t independent actors. They’re usually the same model family, trained on the same Internet, with the same inductive biases and the same blind spots. You can separate context windows and permissions all day long; you’re still asking one brain to grade its own homework.

The best evidence we have is now annoyingly direct. The ICML 2025 paper on correlated errors across 350+ LLMs (“algorithmic monoculture”) found that when two models are wrong on one leaderboard, they agree with each other 60% of the time. Worse for the “just buy the best model” story, the paper reports that larger, more accurate models have highly correlated errors, even across distinct architectures and providers.

Correlation goes up with quality. Your “stronger” stack doesn’t diversify risk, it concentrates it.

Independence isn’t an org. chart property, it’s a weight property.

So when your slide implies “the tester checks the coder,” it’s selling a property the system can’t produce. The handoff doesn’t break the correlation; it just moves it into a new context window.

“But specialization” (yes, it’s real work)

You’re not crazy to want specialization. The vendors aren’t faking it either. Multi-agent architectures are legitimate engineering. They’re paying off in throughput, coverage, and cost.

Anthropic’s own Claude Code subagents documentation describes subagents as “specialized AI assistants” with “their own context window,” custom system prompts, specific tool access, and independent permissions. That matters. It’s how you keep a test-writing agent from being distracted by a sprawling coding transcript, and how you keep a reviewer from quietly “helping” by rewriting the solution.

But notice what that independence is: context separation and permissioning. It’s not weight independence. Under the hood, it’s still Claude.

Google DeepMind says the quiet part out loud too. In the CodeMender launch, they describe how they built “special-purpose agents” for different aspects of code security. It’s a smart design move. And it’s still Gemini Deep Think end-to-end, which means the same family of priors is doing the proposing and the checking.

OpenAI’s April 2026 update on the Agents SDK and “sandbox-aware orchestration” is the same story in platform form: route subagents into isolated environments, give them tools, lock down permissions, make the orchestration robust.

So don’t dismiss specialization. Splitting the work, one agent writes code, another writes tests, another reviews, usually produces more coverage and fewer obvious misses. It’s not that multi-agent is fake, it’s that it’s being sold as independence when it’s actually coordination.

THE FRAME
Specialization buys you breadth. It doesn’t buy you disagreement.

“Then we’ll use a different vendor”

This is the escape hatch because it sounds like classic risk management. Claude writes the code, Gemini writes the tests, OpenAI reviews the whole thing. Now you’ve got diversity. Now you’ve got checks and balances.

You do get something. Cross-vendor setups can reduce single-model quirks, and they can catch some implementation mistakes that one model family tends to gloss over. You’re narrowing the gap.

But you’re not closing it, and the correlated-errors evidence is the reason. The Correlated Errors paper explicitly finds that errors can be highly correlated even across distinct architectures and providers. The authors tie it to “algorithmic monoculture,” which is a polite way of saying: the training data, benchmarks, and optimization targets are converging. Models are different brands of the same stuff.

In practice, that shows up as the same kind of confident wrongness. The same missing edge case because it’s rare in the corpus. The same over-trust in a plausible API behavior. The same tendency to write tests that mirror the implementation, because that’s what “tests” look like in the data they were all trained on.

So yes, use a different vendor if you can afford it. Just don’t let procurement theater become your definition of assurance. If “multi-vendor” is what you’re pointing at when someone asks how you catch the model being wrong, you’re setting yourself up for an ugly surprise.

What actually produces independence

Independence comes from something that can afford to disagree. Not socially, structurally.

If you take code a model wrote and hand it to another model with “review this, does it look right?”, you already know the shape of the answer you’ll get: “yes,” plus a handful of minor suggestions. That’s not because you picked the wrong second model. It’s because assistant training rewards helpful-sounding agreement by default, and that bias is broad. Anthropic’s work on sycophancy in language models (Sharma et al., 2025) shows multiple state-of-the-art assistants reliably skew toward telling you what you want to hear; even humans and preference models will sometimes prefer convincingly written agreement over correctness.

So the mechanism that matters isn’t the vendor swap, it’s adversarial framing. The second pass needs a job that pays for disagreement: a role that’s explicitly trying to break the code, an objective that treats “looks fine” as a failure mode, and a structure that forces the model to take and defend a skeptical position. That’s the difference between asking a colleague for a quick look and putting the code in an interrogation room.

And humans still matter, but only if you structure the job so they can disagree too. If your reviewers are measured on throughput, they’ll rubber-stamp. If they’re measured on reliability, they’ll slow down and fight the model when it’s being slippery. (You’ve seen the difference between “review for velocity” and “review for blame.” Only one of those produces real scrutiny.)

KEY POINT
If your “test” can’t say no to the model, it’s not a test, it’s a second opinion from the same brain.

That’s the prescription your team can act on: keep the agents, keep the specialization, but make sure at least one arrow in the workflow points to something that doesn’t share the weights.

Redraw the slide

Go back to that org. chart slide. Don’t delete it; fix it.

Keep your three boxes if they help people reason about work: Code Agent, Testing Subagent, Review Agent. Those are real roles, and the vendor tooling is getting better at making them operational. Anthropic’s subagent permissioning is useful. OpenAI’s sandbox-aware orchestration is useful. DeepMind’s special-purpose agents are useful. You should use them for what they are: coverage, thoroughness, speed.

But the truthful slide has a new arrow. It points out of the Claude/OpenAI/Gemini cloud to something that can disagree without sharing the same priors: an adversarial second pass whose job is to break the work, or a human reviewer whose incentives are aligned with reliability instead of throughput.

That’s the version you can hold with a straight face. The agents are doing work. They’re just not checking each other in the way the boxes imply. Redraw it that way, and the wallpaper turns back into architecture.