← Plausibly Wrong

ICL: The AI capability you're paying for and not using.

Your model can generate a photorealistic Elon Musk in a Carmen Miranda fruit hat. Ask it to forecast revenue and you get the four-quarter average. The expensive part is In-Context Learning. Guess what it avoids.

Here is a thing that should bother anyone paying attention.

The same AI model your company has stuffed into seven workflows can, on demand, generate a photorealistic image of Elon Musk wearing a Carmen Miranda fruit hat, smiling on the floor of a Tesla factory, with cloth physics on the bananas. The same model can produce a thirty-minute video of Bob Ross breakdancing in his studio, complete with the gentle PBS voiceover narrating the choreography.

And when you ask it to forecast next quarter’s revenue, it averages the last four quarters and hands you a number. Pizza party. Slack confetti. QBR slide built.

That gap - between what the model can do and what it will do by default - is one of the most expensive misunderstandings in AI right now. The name for the thing that closes it is In-Context Learning, or ICL. Almost nobody explains it plainly, which is convenient for vendors and bad for everyone else.

What ICL actually is

You do not need the math. You need the mechanism.

When you prompt a modern model, two broad paths are available.

Path one, cheap: it grabs the most generic version of the task from pretraining and ships something plausible. Forecast? Fine, use a recent average. Contract review? Summarize boilerplate. Customer triage? Sort by obvious keywords.

Path two, ICL: it studies the examples and context you gave it, infers the pattern in your data, and builds a task-specific approach on the fly.

That second path is the useful trick. Nobody retrained the model on your business. No weights changed. No fine-tune ran overnight. You handed it examples in the prompt, and during the act of answering, it adapted to the task sitting in front of it.

That is ICL.

This is why examples matter so much. The model may know ten thousand generic ways to answer a question. ICL is how you get it to notice that your version is different.

Why the model keeps dodging it

Because ICL costs more.

Not more on the invoice in any clean way. More in internal compute. Actually reading examples, comparing them, inferring a pattern, and resisting the obvious shortcut is expensive relative to blurting out the first plausible answer.

And these systems are tuned for efficiency. “Efficiency” here means: do the cheapest thing that gets accepted.

If the user accepts the four-quarter average, why spend more cycles discovering the real structure in the data? From the model’s point of view, and the vendor’s, that is a win. Answer delivered. Capacity preserved. Margin protected.

This is the part polite marketing tends to skip.

THE TELL
Most of the time you are not seeing the model’s best behavior. You are seeing its cheapest defensible behavior.

The compute backdrop

This stopped being theory when the infrastructure started showing through the paint.

SPRING 2026
Anthropic ran out of compute. Claude usage limits tightened in late March. On April 20, GitHub pulled Anthropic’s Opus models out of the standard Copilot Pro plan. Older Opus versions are now gone from every consumer Copilot tier. The latest Opus is locked behind Copilot Pro+ and Enterprise only.

Anthropic’s annualized revenue run rate jumped from roughly $9 billion at the end of 2025 to north of $30 billion by April 2026 (‘Anthropic Tops $30 Billion Run Rate, Seals Broadcom Deal’, Bloomberg, April 6, 2026). On the Dwarkesh Patel podcast in February, CEO Dario Amodei described the industry as operating in “a compute-constrained world” (‘We are near the end of the exponential’, Dwarkesh Podcast, February 13, 2026). Demand outran GPU and inference capacity, and pricing alone did not fix it.

So the visible product got worse. On March 26, 2026, Anthropic tightened Claude session limits during peak hours for Free, Pro, and Max subscribers, with around seven percent of Pro users now expected to hit caps they previously wouldn’t have (‘Anthropic tweaks Claude usage limits to manage capacity’, The Register, March 26, 2026). Through March and April, quality complaints climbed alongside repeated outages on Claude.ai, the API, and Claude Code (‘Claude is getting worse, according to Claude’, The Register, April 13, 2026).

Then GitHub made it explicit. On April 20, 2026, Opus models were pulled from Copilot Pro outright. Opus 4.7 stayed available in Copilot Pro+ and Enterprise; the older 4.5 and 4.6 were slated for removal from Pro+ as well. New sign-ups for Pro, Pro+, and Student plans were paused. GitHub’s own explanation was that, “agentic workflows have fundamentally changed Copilot’s compute demands. Long-running, parallelized sessions now regularly consume far more resources than the original plan structure was built to support” (‘Changes to GitHub Copilot Individual plans’, GitHub Blog, April 20, 2026).

Translated: scarce capacity went to the tiers and customers GitHub and Anthropic could best protect. The reseller bundle got squeezed.

That was the moment the market had to admit this is not just software. It is software sitting on top of finite physical infrastructure, and when the GPUs get tight, product promises get quietly edited.

Why thinking tokens don’t tell you what you think they tell you

There is another trap here.

When a model “thinks longer,” it emits more reasoning tokens before the answer. Vendors bill for them. Buyers see a bigger count and assume more thinking means more real work.

That is only half true.

Reasoning tokens generally do track compute. In its extended thinking research (Anthropic, February 24, 2025), Anthropic put it directly:

Accuracy on math problems improves logarithmically with the number of “thinking tokens” the model is allowed to sample.

More compute can help. But the returns diminish fast.

And token count is not a clean measure of engagement with your problem.

First, a model can produce a long visible chain of reasoning that is still just path one wearing a fake mustache. It looks laborious. It may still be shortcut behavior. The reasoning can be a performance of effort, not proof of it.

Second, longer thinking is not always better. Independent benchmarking of reasoning models has documented cases where they generate roughly eighteen times more tokens than non-reasoning approaches while sometimes producing lower accuracy on simple tasks - what the authors describe as “overthinking” (‘Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models’, Srivastava et al., arXiv). The same way people get worse when they overthink what should have been intuitive.

So yes, compute matters. Yes, tokens are a rough proxy. But no, a long chain-of-thought does not guarantee the model actually engaged your examples instead of decorating a shortcut.

What this actually means

Four practical consequences:

1. Stop accepting the first answer. If the output sounds clean, generic, and suspiciously easy for a problem that should be hard, you probably got path one. Push back. Tell it to use the examples provided, infer the pattern from the data in front of it, and avoid generic heuristics.

2. Examples are an asset. A small library of good examples for recurring tasks - forecasting, contract review, customer triage, deal qualification - often creates more value than another tooling purchase. This is how you make ICL show up on purpose.

3. Vendor demos are tuned. The nice demo is usually held together with careful prompts, curated context, and tasks selected to flatter the system. Out of the box, you get less. If nobody inside knows how to force the model onto the better path, the gap between the demo and production will stay embarrassing.

4. The cheap path is not your friend. Every shortcut the model takes is usually a shortcut through your differentiation. The whole point of using a frontier model is adaptation to your context. If it keeps falling back to generic patterns, you are paying premium prices for commodity output.

The thing nobody says out loud

The AI you are paying for has one mode where it actually grapples with your specific problem, and another where it pattern-matches to the laziest plausible answer.

THE LEVER
The default is the lazy one. The lever that flips it is In-Context Learning.

The companies pulling away are not the ones with the biggest contract. They are the ones that learned how to force the model off the cheap path and onto their actual problem. Everything else is procurement theater.