A second opinion is useless if nobody saw the work.

Everyone loves this pitch because it sounds like a control.

Don’t trust one model. Put a second model on top of it. Have Anthropic draft and OpenAI review. Have Gemini review DeepSeek. Have one AI produce the work and another grade it.

That answer feels right up until you ask one real question.

That’s not review. It’s output inspection.

The pitch sounds tougher than it is

The appeal is obvious. If one model might miss something, bring in a second. If one model is the worker, make the other the reviewer. If they’re from different vendors, even better. Now it sounds like separation of duties. Now it sounds like oversight.

It isn’t.

Not because the models might secretly think alike. Not because the vendors may share failure patterns. Those are real arguments, but they are not this argument.

This argument is simpler and more damaging: output review is weaker than real review even if the reviewer is excellent, even if the models are different, even if the checker is stricter than the doer. Research on LLM-as-judge systems points the same way – model judges exhibit systematic position bias, where their verdicts depend partly on presentation artifacts rather than underlying quality (‘Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge’, Shi et al., IJCNLP-AACL 2025). Even a strong judge looking at the output is still just looking at the output.

The failure is structural. A second model looking at a finished artifact is doing inspection. Real review is interrogation.

That is the whole post.

In real work, review is a conversation

Take engineering, because the pattern is clear there even if you’ve never worked in a software shop.

A pull request is not valuable because a second engineer stares at code and confirms it looks tidy. It is valuable because the reviewer can force the author to explain decisions. Why is this handled here instead of earlier? What happens when this call fails? Did you test the ugly case or only the happy path?

The code is the artifact under review. The decisions are the thing being reviewed. An empirical study of code review at Microsoft found exactly this: review derives much of its value from interactive explanation and surfacing alternatives, not from inspecting the code for defects (‘Characteristics of Useful Code Reviews: An Empirical Study at Microsoft’, Bosu, Greiler, and Bird, MSR 2015).

That distinction matters because serious mistakes are usually not visible as surface defects. They come from hidden assumptions, missing branches, bad abstractions, wrong constraints, or misunderstood requirements. A real reviewer is trying to expose those choices and pressure-test them.

The same pattern shows up elsewhere. In finance, research, operations, legal, policy, procurement, security, and planning, review means tracing assumptions, sources, alternatives, and version-to-version changes. The artifact matters. The decision trail matters more.

THE SHIFT

Review is not proofreading for plausibility. Review is cross-examining judgment.

The big miss: outputs don’t explain themselves

This is the mistake in model-on-model review. People treat the final artifact like it contains enough information to reconstruct the quality of the work that produced it.

Usually it doesn’t.

A memo does not tell you which source was weak but used anyway. A forecast does not tell you which alternative was dropped for the wrong reason. A code diff does not tell you which failure mode the author never thought about.

That is where the real work of review happens.

When a human reviewer asks how a number was derived, why an approach was chosen, or what alternatives were ruled out, they are probing the reasoning that produced the artifact. You cannot get that from surface inspection alone.

That is why so much bad work survives first-pass review. It looks fine. It reads cleanly. The formula balances. The code compiles. The chart is persuasive. The recommendation sounds sane.

Then someone asks one question about why, and the whole thing falls over.

A second AI can inspect. It can’t interrogate.

A second model can do useful things. It can spot inconsistencies, arithmetic mistakes, missing citations, and obvious contradictions.

Fine. Use it for that.

But do not confuse that with review.

A model that only sees the finished artifact sees the endpoint, not the process that produced it. It cannot see which assumptions were improvised, which branch was abandoned too early, or which source was chosen because it was convenient rather than correct.

Unless the system actually has access to a captured process that can be examined, challenged, and revised, that exchange is an imitation of review. The model is still judging the artifact from the outside.

If all you provide is the output, then all it can do is output inspection.

This is why real reviewers are annoying

Think about the most valuable reviewer you know. Not the polite one. The annoying one. The one who keeps stopping the meeting.

That person is useful for a reason. They do not just inspect the artifact sitting in front of them. They keep pulling the discussion backward into the decisions that produced it. They want to know where the assumption came from, why the option set narrowed, what changed between versions, what failure mode is being ignored. They are not reviewing the surface. They are reopening the path.

A finished artifact is compressed work. Compression strips out information: false starts, abandoned branches, tradeoffs, unvoiced assumptions. Those are exactly the things a good reviewer wants to inspect.

That is why review in real organizations becomes interactive almost immediately. The first pass over the output is just the doorway. The actual work starts when the reviewer begins forcing hidden choices back into view. That is not a style preference. That is the mechanism.

More models don’t solve the wrong problem

Once you see the problem, a lot of popular fixes collapse immediately.

A committee of models doesn’t solve it. Three models looking at the same finished artifact are still three outsiders inspecting the endpoint. A stronger judge model doesn’t solve it. A more expensive judge model doesn’t solve it. Cross-vendor review doesn’t solve it. A ranker, verifier, scorer, or ensemble doesn’t solve it if all of them are still limited to the final output.

More inspection is still inspection.

That does not make these systems useless. It puts them in the right box. They are filters, rankers, and quality screens. They can improve selection, scoring, and bounded evaluation tasks.

Recent work from Microsoft Research on verifiers and reasoners is interesting because it tries to tighten the loop between generation and checking rather than pretending the endpoint alone is enough (‘Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers’, Microsoft Research, July 2025). Google’s work on rankers and judges is useful for mapping where model-based evaluation helps and where it breaks (‘Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation’, Google Research, 2025). And Google’s clarification work points in a more promising direction, because better questioning matters more than prettier grading (‘Learning to clarify: Multi-turn conversations with Action-Based Contrastive Self-Training’, Google Research, June 3, 2025).

But none of that changes the central point: if the reviewer only gets the artifact, the reviewer is working with lossy evidence, and lossy evidence is a weak basis for review.

What would count as actual oversight

If you want something closer to real review, stop obsessing over which model is the reviewer and start asking what the reviewer can interrogate.

THE FRAME

Reviewable evidence is not just the final answer. It is the record of inputs, assumptions, alternatives, and points of uncertainty that shaped that answer.

A serious reviewer needs to see what inputs were used, what assumptions got introduced, what alternatives were considered and discarded, what uncertainty showed up, and where the system had to guess. Those are the real objects of review. Those are the handles a control can actually grab.

The future here is not one model writing something and another model nodding at the output with a clipboard. The only version that deserves to be called oversight is one where the process can be inspected and cross-examined, because the point of review is not to admire the answer. It is to challenge how the answer happened. If your reviewer cannot get access to that, you do not have oversight.

You have output cosplay.