← Plausibly Wrong

Your AI didn't go down. It got worse.

When Anthropic's status page showed elevated errors on June 16, the visible failures were the easy part. The dangerous part was the requests that succeeded.

On June 16, our Claude Code session lost its mind.

Not in the way you’d notice. There was no outage banner. No error in the terminal. No “service unavailable” message. Claude kept working. It read files, proposed edits, wrote code, explained itself. It sounded like Claude.

Except the decisions were wrong.

We run Claude Code sessions with an adversarial reviewer attached – a second AI watching every tool call, every file write, every decision before it executes. Normally, the reviewer catches laziness: skipped file reads, shallow fixes, shortcuts the agent takes because the cheap path looks close enough.

This was different. Claude was making genuine logical errors – and the tell wasn’t in the output. It was in the process. The agent was reading files that had no bearing on the task, ignoring results it had just received, proposing edits that contradicted its own observations from two steps earlier. The reviewer would block a bad move, explain why it was wrong, and Claude would acknowledge the correction – then take the same wrong branch again. Five corrections deep on something that normally takes zero.

The moves were legal. The game was lost.

Claude wasn’t down. It was up and confidently broken.

The status page tells you one thing. The work tells you another.

That same day, Anthropic’s status page recorded multiple incidents (‘Claude.ai Status History’, Anthropic). Elevated errors across all Sonnet and Opus models. Claude Code explicitly listed as affected. Error rates around 10% at peak. Most requests still returned something. The service was not simply offline.

We cannot prove from the public incident report that this caused our session’s behavior. The report says elevated errors, not degraded reasoning. But that distinction is exactly the problem. In LLM systems, the visible failure is the request that errors out. The dangerous failure is the request that succeeds while the system serving it is sick.

JUNE 16, 2026
Anthropic reported elevated errors affecting Sonnet, Opus, and Claude Code. The incident report measures availability, not reasoning quality.

Success is not binary

Many conventional APIs make success look binary. The payment processed or it didn’t. The record saved or it didn’t.

An LLM request has at least three layers of success. Transport: did the HTTP request complete? Protocol: did the response have the right structure – valid message, valid tool call? Semantic: was the answer actually good?

Status pages are built for the first layer. Product telemetry may catch parts of the second. The third – whether the reasoning held, whether the code does what the agent thinks it does, whether the tool calls followed from the evidence – is the one that matters. And it is the one nobody is measuring during an incident.

THE GAP
The request can succeed while the reasoning fails.

What a partial outage does to the requests that “work”

A 10% error rate does not mean 10% of requests fell into a hole while the other 90% ran on a pristine system. In a distributed inference service, visible errors are the edge of a larger stress field. They are the smoke alarm. The fire is the underlying instability.

When serving nodes fail, load balancers route around them. That sounds like a fix. It is also a new problem: the remaining capacity absorbs traffic it was not sized to handle all at once. Queues grow. Retries pile on. Cache locality breaks. Requests that would normally land on a warm, well-provisioned worker get squeezed through an overloaded fallback path.

None of these have to change the model weights. They change the conditions under which the model is served. A rerouted request may land in a more crowded batch. A long-context coding session may lose cache locality and be recomputed under a tighter deadline. A fallback path may preserve availability while quietly altering context handling or generation budget. The response still completes. But the agent had less effective room to think.

A 200 OK from a stressed LLM inference stack is not the same as a 200 OK from a healthy one.

THE TRAP
Availability preserves the interface. It does not guarantee the conditions that make reasoning reliable.

Anthropic has documented exactly this class of failure

In September 2025, Anthropic published a postmortem describing three infrastructure bugs that intermittently degraded Claude’s response quality (‘A postmortem of three recent issues’, Anthropic Engineering, September 2025). A context-window routing error sent requests to the wrong server pool – at its worst, affecting 16% of Sonnet requests and roughly 30% of Claude Code users in the affected window. A runtime optimization introduced output corruption. A compiler bug caused inconsistent token selection. None of these required the model weights to change. All of them produced degraded output from a system that looked operational. The failures were intermittent – the system did not fail closed, it sometimes produced abnormal responses from ordinary-looking requests.

Anthropic’s own evaluations did not initially catch the degradation. User reports did.

In April 2026, they published a Claude Code quality postmortem identifying three product-layer changes that degraded coding quality while the API remained healthy (‘Claude Code quality postmortem’, Anthropic Engineering, April 2026). One was a caching bug that silently stripped prior reasoning from the session after an idle threshold. Claude kept operating. It kept making tool calls. But the continuity of reasoning was gone. Users saw forgetfulness, repetition, and increasingly erratic behavior – the exact pattern of an agent that keeps working while the work keeps getting worse.

For a coding agent, degradation is damage

A degraded chat answer is annoying. A degraded agent action is damage.

Claude Code is not a single question and answer. It is a loop: read a file, form a plan, choose a tool, execute, interpret the result, revise, continue. Each step becomes context for the next. If step three is wrong, steps four through ten are contaminated. The agent doesn’t pause to reconsider. It keeps moving, because moving is what agents do.

And here is why process matters more than output: before the code is obviously broken, the work pattern gets strange. The agent opens files with no bearing on the task. It reruns commands whose results it already has. It proposes patches that don’t follow from the failing test. It accepts a correction, then takes the same wrong branch again. Every tool call is valid. Every edit is syntactically legal. But the trajectory is nonsense.

By the time the final output looks suspicious, the process has already been wrong for ten minutes.

The first visible symptom of agent failure is often not bad code. It is a bad trajectory.

The observability gap

Infrastructure monitoring asks: did the request complete? Process review asks: does the work make sense?

These are fundamentally different questions. A status page can tell you whether tokens were served. It cannot tell you whether the agent is reading the right files, whether its tool calls follow from its observations, whether it incorporated a correction or just acknowledged it, or whether the patch it is about to write contradicts the test failure it just saw.

That gap is where damage hides.

The dangerous part is the part that works

We caught it because we had a reviewer watching the agent’s process in real time – not grading the output after the fact, but watching which files it opened, which tools it called, and whether those decisions tracked with the evidence in the session. The reviewer blocked the bad tool calls before they wrote broken code to disk. It flagged the trajectory going wrong before the output existed.

Most people don’t have that.

Most people have Claude Code writing files, and a developer reviewing the result at the end – if they review it at all. During a partial outage, that developer sees the same confident output they always see. The terminal fills up. Files get written. The agent says “done.” Nothing in that experience signals that the system was serving degraded reasoning underneath valid-looking responses.

THE RISK
The dangerous requests are not the ones that error. They are the ones that return confidently enough to be trusted.