← Back to Payloads
ai2026-06-02

GPT-55 tops the coding leaderboard, 63 of vendors hide their

Datacurve's DeepSWE benchmark crowned GPT-5.5 at 70% and exposed a 32% error rate in SWE-Bench Pro. Separately, MiniMax claims 15.6x faster decoding, and 63% of AI vendors won't tell you which model you're actually buying.
Quick Access
Install command
$ mrt install ai
Browse related skills
GPT-55 tops the coding leaderboard, 63 of vendors hide their

GPT-55 tops the coding leaderboard, 63 of vendors hide their

A new coding benchmark crowned GPT-5.5 at 70% and exposed a 32% error rate in the most-cited coding leaderboard. Separately, MiniMax claims a 15.6x decoding speedup, and the industry's dirty secret — most vendors won't tell you which model you're actually getting — is now getting audited.

What You Need to Know: On May 26, 2026, a startup called Datacurve released DeepSWE, a 113-task coding benchmark that crowned OpenAI's GPT-5.5 the clear leader at 70% — sixteen points ahead of the next-best model. The same audit found SWE-Bench Pro's automated graders reject correct solutions 24% of the time and accept wrong ones 8.5% of the time. Separately, MiniMax claims a 15.6x decoding speedup, and roughly 63% of enterprise AI vendors are hiding the underlying model on their pricing pages.

Why It Matters

  • For engineering leaders: The "all top models are roughly equal" story was always a benchmark artifact. DeepSWE is a more honest test, and the spread is real.
  • For benchmark researchers: If SWE-Bench Pro is wrong on a third of trials, every model card built on it deserves an asterisk.
  • For procurement teams: "63% of vendors hide the underlying model" is a real number, and it's higher than the marketing suggests.
  • For inference engineers: A 15.6x decoding speedup is a big deal for cost-per-token economics.

What Actually Happened

DeepSWE crowns GPT-5.5 and exposes SWE-Bench Pro

Datacurve's DeepSWE benchmark, released May 26, 2026, is a 113-task evaluation spanning 91 open-source repositories and five programming languages. Reference solutions average 668 lines added across 7 files — roughly 5.5x the scope of SWE-Bench Pro's typical 120-line tasks. Results: GPT-5.5 leads at 70%, GPT-5.4 at 56%, Claude Opus 4.7 at 54%, Claude Sonnet 4.6 at 32%, Gemini 3.5 Flash at 28%. Claude Haiku 4.5 — which scores 39% on SWE-Bench Pro — collapses to zero on DeepSWE, suggesting significant overperformance on easier, possibly contaminated benchmarks. GPT-5.5 reaches 70% at a median cost of $5.80 per trial and 20 minutes of wall-clock time, making it the cost-efficiency frontier. Sources: VentureBeat — DeepSWE blows up the AI coding leaderboard, CodingFleet — SWE-bench Pro Explained, ExplainX — DeepSWE Benchmark Analysis, Vals AI SWE-bench scores.

SWE-Bench Pro's verifiers are wrong on ~32% of trials

The more damaging finding: Datacurve's audit drew 30 tasks at random from SWE-Bench Pro, ran three rollouts across 10 frontier model configurations, and used an LLM judge to independently assess pass/fail. SWE-Bench Pro's automated graders accepted wrong implementations 8.5% of the time and rejected correct ones 24% of the time — a combined 32% error rate. In one documented case, the gold-standard pull request refactored a private helper function; an agent that inlined the same logic (a perfectly valid engineering choice) was marked wrong because the test suite tried to import a symbol that only existed in the original author's implementation. DeepSWE's own verifiers kept both error rates near zero (0.3% false accept, 1.1% false reject). Reference: VentureBeat DeepSWE article, Datacurve blog.

MiniMax claims 15.6x faster decoding

In a separate piece of the AGI Weekly package, MiniMax claims a 15.6x speedup in inference decoding on their latest model release. The company hasn't published the full benchmark methodology as of the TLDR post date, but the claim is consistent with a wider industry push toward speculative decoding, paged attention variants, and tighter CUDA kernels. If the number holds under third-party testing, it materially changes the unit economics of long-context agents. Reference: TLDR's AGI Weekly coverage in VentureBeat.

63% of AI vendors hide the underlying model

A separate line of research, surfaced through Netguru and Pertama Partners' vendor evaluation work and amplified by Forrester analyst Michael Forrester, finds that 63% of enterprise AI vendors don't disclose the underlying foundation model on their pricing pages or in standard contract terms. The 92% data-usage-rights claim from a parallel set of vendors is the inverse problem: vendors who over-claim their data rights. For buyers, the practical impact is that you can't honestly compare two AI SaaS products if you don't know whether one is running on GPT-5.5 and the other on a fine-tuned Llama 4. Reference: Michaelrishiforrester.com — vendor analysis, Bosio Digital — OpenAI vs Anthropic vs Google AI Agents.


The Take

The DeepSWE audit is the most important AI benchmark story of 2026 so far, and I don't think the industry has absorbed it yet. SWE-Bench Pro is the benchmark every Fortune 500 procurement team is using to justify six- and seven-figure AI coding contracts. If its verifiers are wrong a third of the time, those contracts are built on sand. GPT-5.5 vs Claude Opus 4.7 is the easy headline, but the harder story is that the test was lying. For MiniMax's 15.6x claim and the 63% vendor-hiding stat, the same rule applies: don't take the marketing number at face value, ask which benchmark, which verifier, and which model is actually answering. The buyers who do that are going to be the ones who still have jobs in 2027.

Quick Summary

GPT-5.5 leads DeepSWE at 70% while exposing a 32% error rate in SWE-Bench Pro. MiniMax claims a 15.6x decoding speedup, and 63% of AI vendors won't tell you which model you're actually buying.


Sources:

Source: VentureBeat | mr.technology — The Master Skill Index

Related Dispatches