Evaluation

Your AI didn't go down. It got worse.

When Anthropic's status page showed elevated errors on June 16, the visible failures were the easy part. The dangerous part was the requests that succeeded.

June 17, 2026 · 7 min read

A second opinion is useless if nobody saw the work.

Having one model produce the work and another inspect the result is not review. Review means seeing the work happen, not just judging what survived.

May 17, 2026 · 7 min read

Prompt engineering is a credibility tell. Not a discipline.

Engineering assumes a target that holds still. Frontier language models do not. The fastest way to identify someone who has not run an AI system in production is to watch them say 'prompt engineering' anyway.

April 28, 2026 · 6 min read

AI writing its own tests is theatre.

Different agents in a workflow don't create independence when they share the weights. The architecture is real engineering. It's just not the check your slide says it is.

April 19, 2026 · 5 min read

Browse by topic

All topics →

Model Behavior Agents AI Hype RLHF Compute ICL Next-Token Prediction Prompt Engineering Sycophancy