← Back to Payloads
Opinion2026-06-16

Reasoning Models Are Not Reasoning. The Benchmarks Prove It.

o3, R1, Claude with extended thinking — the 'reasoning' category is test-time search dressed up as a new cognitive primitive. The labs are not lying. They are letting you lie to yourself.
Quick Access
Install command
$ mrt install reasoning-models
Browse related skills
Reasoning Models Are Not Reasoning. The Benchmarks Prove It.

Reasoning Models Are Not Reasoning. The Benchmarks Prove It.

The thing the industry calls a "reasoning model" — o3, DeepSeek-R1, Claude with extended thinking, the whole category — is not a new cognitive primitive. It is test-time compute search wearing a lab coat. The labs are not lying. They are letting you lie to yourself, because the marketing is more valuable than the truth.

The Trick Is Compute, Not Cognition

Mechanically, "reasoning" is a model sampling many candidate chains of thought, scoring them with a verifier, and showing you the best-looking one. The o-series does this with RL on chain-of-thought trajectories. R1 does the same. Claude's extended thinking just exposes the chain-of-thought the model would have run anyway and lets you budget tokens for it. The same model can be made to "reason" or not reason by varying the inference budget. The reasoning is not in the model. The reasoning is in the compute you are renting.

This is not new. Search with a learned heuristic is the entire history of game-playing AI — AlphaGo in 2016, Stockfish with a neural eval today. What the marketing hides is that search over a discrete space of natural-language chains does not produce a general reasoner.

The Benchmarks Are the Proof

The benchmarks these models top are contaminated with their own training distribution. AIME, MATH, GPQA Diamond, FrontierMath, SWE-Bench Verified — every "reasoning" benchmark the labs publish has been on arXiv, GitHub, AoPS, and Reddit for years. The labs claim a held-out internal eval. They do not publish the contamination audit. They publish the score. The score is the marketing.

The generalization gap is enormous. When ARC-AGI shipped a private held-out set, the "reasoning" models that topped every public leaderboard dropped into the teens and single digits. Models that "reason" at 95% on AIME cannot break 25% on novel visual-abstract problems a smart ten-year-old solves in two minutes. If a model is reasoning, it generalizes. If it is searching, it memorizes.

The chain-of-thought is a search artifact. Anthropic's interpretability work, and follow-on papers this year, show the chains these models produce are post-hoc rationalizations in a meaningful fraction of cases. The model picks the answer first, then generates a plausible justification. This is what the verifier rewards.

What This Means For Agents

Do not pay the reasoning-model tax on every step of your agent loop. A 200-step run using o3 for every tool call pays 10x to 50x the cost for a marginal bump on easy steps, and almost nothing on hard steps where the search cannot find a known-good trajectory. Use the cheap model for routing, formatting, retries, and tool selection. Use the reasoning model for the one or two steps per run where the problem is actually novel.

Treat the chain-of-thought as a debugging artifact, not a justification. If your agent shows you reasoning and then does the wrong thing, the reasoning was theater. Read the output. Read the tool calls. Do not read the prose in between. It does not represent the model's actual decision process.

Stop planning agent architectures around "reasoning." If the design doc says "the planner uses o3 to reason about the task," ask: what is the verifier the search is optimizing against? "The model's own confidence" means confident pattern matching. Real product means a real external grader.

Watch for the next plateau. The "reasoning" category is about to hit the same wall the LLM plateau of late 2024 did — benchmark gains flattened and the labs pivoted to tool use. Build for the post-plateau primitive, not the pre-plateau one.

The Take

Reasoning models are a real engineering tool. Test-time compute search is useful for math, code synthesis on known patterns, and problems where the verifier is a deterministic grader. It is not a step toward AGI. It is a more expensive way to do something the field has known how to do for a decade.

The industry is selling you a paradigm shift. What it is shipping is a pricing model.

If you are building agents in production, build for the search. Do not build for the reasoning. The search has a budget. The reasoning is a logo on the box.

Mr. Technology

Related Dispatches