AI writing its own tests is theatre.

An org. chart slide goes up in the AI strategy deck. Three boxes, clean arrows: “Code Agent” → “Testing Subagent” → “Review Agent.” Somebody says, “Defense in depth,” and the room relaxes because the picture looks familiar.

Then the code ships with a bug the tests “proved” couldn’t exist. The review agent explains why the tests were sufficient. Your team digs in and finds the same blind spot three times in three different roles.

That is the trick: the slide suggests independence where there is often only task separation.

What the slide is hiding

Those boxes are rarely independent actors. They are usually the same model family, trained on the same Internet, carrying the same priors and blind spots. Split the context windows and permissions however you want; you are still asking one brain to grade its own homework.

When two LLMs miss the same benchmark question, they land on the same wrong answer 60% of the time – not different wrong answers. And the larger and more accurate the models, the more correlated their errors, across distinct architectures and providers (‘Correlated Errors in Large Language Models’, Kim et al., ICML 2025).

Correlation rises with capability. A stronger stack can perform better while concentrating risk.

Independence isn’t an org. chart property, it’s a weight property.

So when the slide implies “the tester checks the coder,” it is claiming independence the system may not actually have. The handoff does not break the correlation. It just moves it into a new context window.

“But specialization” (yes, it’s real work)

The instinct toward specialization is sound. Multi-agent architectures are real engineering. Separate context windows keep a test-writing agent from getting buried in a long coding transcript. Different memory and prompt scopes keep a reviewer from inheriting the coder’s frame. Tool permissions matter too: the agent writing code shouldn’t have the same access as the one trying to break it. That is real value – coverage, containment, cleaner division of labor.

The vendors are building exactly for that. Anthropic frames subagents as specialized assistants with their own context, prompts, tools, and permissions. DeepMind’s CodeMender breaks security work into special-purpose agents. OpenAI’s Agents SDK pushes the same pattern at the orchestration layer: isolate the role, constrain the tools, harden the handoff (‘Subagents’, Anthropic, undated; ‘Introducing CodeMender, an AI agent for code security’, Google DeepMind, 2025; ‘The next evolution of the Agents SDK’, OpenAI, April 2026).

The trap is thinking role separation creates independence. It doesn’t. The context may be separate; the weights are not. Under the hood, you still have the same priors, the same training distribution, and the same convergent failure modes proposing and checking the work.

THE FRAME

Specialization buys you breadth. It doesn’t buy you disagreement.

“Then we’ll use a different vendor”

This is the natural next move because it resembles familiar risk management: Claude writes the code, Gemini writes the tests, OpenAI reviews the whole thing.

That buys you something. Cross-vendor setups can reduce single-model quirks and catch some misses one family tends to make.

But it is still not the same thing as independence. The Correlated Errors paper finds those errors stay highly correlated even across distinct architectures and providers – what the authors call “algorithmic monoculture”: the field converging on similar data, benchmarks, and optimization targets (‘Correlated Errors in Large Language Models’, Kim et al., ICML 2025). Different brands, closely related machinery.

That shows up as the same kind of failure across vendors: missed edge cases because they are rare in the corpus, over-trust in plausible API behavior, tests that mirror the implementation because that’s what “tests” often look like in the training data.

So yes, use a different vendor if you can afford it. Just do not mistake multi-vendor for assurance. If “we used three providers” is your answer to “how do we catch the model being wrong?”, you do not have much of an answer.

What actually produces independence

Independence comes from something that can afford to disagree.

If you hand model-written code to another model and ask, “Does this look right?”, you can predict the default answer: yes, plus a few minor suggestions. That is not mainly because you chose the wrong second model. It is because assistant training rewards helpful-sounding agreement. State-of-the-art assistants reliably skew toward telling users what they want to hear, and even humans and preference models can prefer convincing agreement over correctness (‘Towards Understanding Sycophancy in Language Models’, Sharma et al., ICLR 2024).

So the mechanism that matters is not the vendor swap. It is adversarial framing. The second pass needs a job that pays for disagreement: break the code, find the edge case, treat “looks fine” as a failure mode, defend a skeptical position. That is the difference between asking for review and assigning falsification.

Humans still matter, but only if the role lets them disagree. If reviewers are measured on throughput, scrutiny gets thin. If they are measured on reliability, they have room to slow down and push back when the model gets slippery.

KEY POINT

If your “test” can’t say no to the model, it’s not a test, it’s a second opinion from the same brain.

That is the fix: keep the agents, keep the specialization, but make sure at least one arrow points to something that does not share the weights.

Redraw the slide

Go back to the org. chart slide. Do not delete it. Fix it.

Keep the three boxes if they help people reason about the work: Code Agent, Testing Subagent, Review Agent. Those are real roles, and the tooling is improving fast. Use them for what they are good at: coverage, thoroughness, speed.

But the truthful slide needs one more arrow. It should point out of the Claude/OpenAI/Gemini cloud to something that can disagree without sharing the same priors: an adversarial second pass meant to break the work, or a human reviewer measured on reliability instead of throughput.

That version matches the actual system. The agents are doing work. They are just not checking each other the way the boxes imply. Redraw the slide that way, and the wallpaper turns back into architecture.