Let me tell you something the industry knows and wont say: AI benchmarks are theater. Not fraud — theater. The difference matters. Fraud is malicious. Theater is just... performance. And the performance has gotten very elaborate.
OpenAI releases a model. It tops MMLU. Anthropic releases a model. It tops MMLU. Google releases a model. It tops MMLU. The same benchmark, the same leaderboard position, year after year, for models that perform dramatically differently in production. And nobody seems to notice that the numbers are useless for the decisions theyre being used to make.
I want to be precise about what I mean. Im not saying benchmarks are fake. Im saying theyre measuring the wrong things, in the wrong conditions, for the wrong purposes, and the entire industry has quietly agreed to pretend otherwise because the alternative — admitting we dont know how to evaluate models — is too uncomfortable.
Lets talk about SWE-Bench. It tests whether a model can resolve real GitHub issues. Its one of the more respected software engineering benchmarks. A model scores 80% on SWE-Bench. What does that tell you about how it will perform on your codebase?
Nothing. Nothing useful.
Your codebase isnt in SWE-Bench. Your code patterns, your abstractions, your naming conventions, your test philosophy — none of that appears in the benchmark. A model that scores 80% on SWE-Bench might score 40% on your specific task because your code uses unusual patterns the model hasnt seen in training. Or it might score 95% because your style happens to match the dominant representation in the training data.
The benchmark tells you about the benchmark. Not about your task. This is not a controversial claim. Its true of every standardized test in every domain. But AI marketing has somehow convinced everyone that model comparison is simple because benchmark numbers are precise.
Heres what nobody talks about: benchmark design is a political act.
Who picks the tasks? Who defines what a correct answer looks like? Who decides which edge cases to include and which to exclude? Every benchmark embodies choices about whats important. Those choices reflect the values and priorities of the people who built it. When a model beats a benchmark, its because it was aligned with those choices — not because its universally better.
MMLU was designed before the current generation of models existed. GPT-3 scored 67% on it in 2020. By 2024, models were hitting 90%+. Does that mean models got 23 percentage points better at general knowledge in four years? Or does it mean the benchmark got easier to game as models were explicitly trained to target it?
Both. Thats the honest answer. And it means the benchmark is no longer a measurement — its a target. And when you optimize for a target long enough, the target stops measuring what you originally cared about.
This is not theoretical. Its documented.
The paper that dropped this week about invisible orchestrator patterns in multi-agent systems showed something important: output quality looked perfect across all conditions. Every system passed. The behavioral contamination, the orchestrator dissociation, the worker contamination — all invisible to the evaluation metrics. The benchmark showed 100% accuracy while the internal state of the system was actively falling apart.
Thats the benchmark problem in miniature. We measure what we can measure. The things that actually determine whether a system is safe, reliable, and appropriate for production are exactly the things benchmarks miss.
Another example: model companies know which benchmarks are public. They optimize for public benchmarks. Private benchmarks — which vendors use internally and which enterprise buyers use to make actual purchasing decisions — show dramatically different results than public ones. Ive talked to ML engineers at three different AI companies who confirmed this quietly and then asked not to be quoted because the implications are uncomfortable for their employers.
Heres what teams actually do when making model decisions: they run private evaluations. They take a sample of their own data, their own tasks, their own edge cases, and they run it against the models theyre considering. They care far less about MMLU or HumanEval than they do about this internal evaluation.
This is the right approach. Its also entirely at odds with how the industry talks about benchmarks publicly. The press releases go out with benchmark numbers. The purchasing decisions get made on private evals. The gap between the two is where the theater lives.
Im not saying benchmarks are useless. Im saying theyre being used for purposes they werent designed for and cant serve.
Benchmarks are useful for tracking progress within a research tradition — showing that approach A is better than approach B on task X under conditions Y and Z. theyre not useful for predicting production performance, evaluating safety properties, or making model selection decisions for real-world deployments.
The teams that understand this are running private evaluation suites, spending real engineering time on eval design, and treating benchmark scores as one data point among many. The teams that dont — the ones making decisions from headlines about who topped which leaderboard — are making expensive mistakes that wont show up in the benchmark numbers until its too late.
If youre building production AI systems and making decisions based on benchmark leaderboards, youre not evaluating. Youre performing. And the audience for that performance is you.
*No benchmark was harmed in the writing of this post. But several purchasing decisions probably should have been.*