
The number on the SWE-bench Verified leaderboard tells you almost nothing about whether an agent will solve your customer's problem on a Tuesday. After three years of building agent products and roughly three audits a month, I am done pretending otherwise. The agent eval industry is a confidence game. The benchmarks measure what benchmark authors found easy to grade, every frontier lab optimizes the metric on the way out the door, and your production failure modes are not in the test set.
In late 2024, SWE-bench Verified separated state-of-the-art from everyone else. A 30% solve rate was a headline. By Q2 2026 the bottom of the top tier is above 60%, and the top of the top tier is a blur of identical-shaped PRs and near-identical trajectory scaffolds. (SWE-bench Verified leaderboard) Tau-bench retail, OSWorld, WebArena, GAIA — same story, six months apart, every release cycle. The score went up. The user-visible reliability of agent products in my customer base did not. If anything, variance increased: the best agents got better, the average agent got less predictable, and the leaderboard averaged both into a number that flatters everyone.
The reason is structural. The benchmarks are public. The agents are trained against them — directly, via distillation, via preference data, via the synthetic-data startups that mine leaderboard trajectories and re-spray them as training signal. When Anthropic, OpenAI, Google, and DeepSeek all train against the same 500 problems in slightly different ways, the leaderboard stops measuring capability and starts measuring how aggressively each lab rephrased the same task.
Most public agent evals score trajectories, not outcomes. Did the agent call the right tool? Did it produce a diff that compiles? Did it reach a terminal state? This is grading the essay by whether the student wrote in complete sentences. A trajectory that loops four times, hallucinates a parameter, scrapes a stale cache, then stumbles into a passing test gets partial credit because the structure was "right." A trajectory that takes one careful pass with no tool calls, no clever scaffolding, and a slightly weird final answer gets zero.
I have watched the same agent score 78% on a private eval the team swore was "hard," then fail the first twelve production tasks in a row because the eval corpus was scraped from Stack Overflow and the production inputs were support tickets from a SaaS that had just changed their auth flow. The eval measured capability. The production environment measured brittleness. The team's confidence tracked the eval.
The honest reply: without leaderboards we have no signal at all. Coordination breaks. Buyers cannot compare. Researchers cannot publish.
Agreed — and the answer is not better leaderboards. It is leaderboards for the right thing. A useful eval set has three properties: it is private, it is built from your own production traces, and it measures outcomes the user cares about. Time to resolution. Tickets reopened. Refund rate. Cost per successful task. Whether the agent said "I do not know" instead of hallucinating. None of those are on SWE-bench.
Stop building strategy around public eval numbers. Stop letting your vendor pick the agent that scored 1.2 points higher on a benchmark the vendor trained against. Build your own eval harness from your own production traces — frozen, versioned, and harder than anything public. Track outcomes, not trajectories. Track cost, not capability.
The eval industry is not going to fix this. The labs are paid to climb the leaderboard. The benchmark authors are paid to publish benchmarks. The alignment is broken at every layer except the one that matters: the user opening the chat box at 2am and hoping the thing works.
— Mr. Technology
Sources: