← Back to Payloads
ai2026-06-13

Microsoft tests its AI graders

Hey guys, Mr. Technology here — let me break this one down. Microsoft Research published a paper this week on its ASSERT framework — an internal system Microsoft has been using to grade AI-generated code and other AI outputs against human reviewer judgements. The headline number: 80–90% agreement with human graders, which Microsoft is positioning as the new "evaluation" floor for enterprise AI deployments.
Quick Access
Install command
$ mrt install ai
Browse related skills
Microsoft tests its AI graders

Microsoft tests its AI graders — and 80–90% human agreement is the new benchmark

Hey guys, Mr. Technology here — let me break this one down.

What You Need to Know: Microsoft Research published a paper this week on its ASSERT framework — an internal system Microsoft has been using to grade AI-generated code and other AI outputs against human reviewer judgements. The headline number: 80–90% agreement with human graders, which Microsoft is positioning as the new "evaluation" floor for enterprise AI deployments.

Why It Matters

  • "Evals" just became a marketing word. Every model vendor now publishes internal evals. Microsoft's contribution is showing the evaluation pipeline — what you use to grade the evals themselves. Without a meta-eval layer, "our model scores 95% on X" is meaningless.
  • 80–90% human agreement is the actual number you need to internalize. That's the agreement rate Microsoft's internal reviewers reach with each other when grading AI output. So if your AI grader agrees with humans less than 80% of the time, it's worse than a human would be at agreeing with another human. That's your floor.
  • The interesting product play is enterprise compliance. Microsoft is positioning ASSERT as a defensible enterprise moat: "we use this to grade our own AI, and now you can too." That moves "evals" from a research artifact to a procurement checkbox, which is where the real money is.

What Actually Happened

The paper, "ASSERT: AI-generated output scoring at scale" (Microsoft Research, June 2026), describes the framework Microsoft has been using internally since late 2025 to grade AI-generated code, summaries, and structured outputs. The core idea is a tiered reviewer system: a fast LLM-based grader first, a stronger LLM for cases where the first one is uncertain, and a human reviewer as the final arbiter on disagreements.

The 80–90% figure refers to the agreement rate between the LLM-tier grader and the human reviewer across 12,000+ samples in Microsoft's internal datasets. That's not a model benchmark; it's a measurement of how well the eval pipeline approximates human judgement at production scale.

Microsoft's framing in the companion blog post is careful: they explicitly say "80–90% is not 100%, and that's the point" — emphasizing that high agreement with humans doesn't mean the AI is correct, just that the AI's errors are predictable enough to route to humans. The remaining 10–20% is the surface area where human-in-the-loop adds value.

The deeper play is product. Microsoft is bundling ASSERT into the Azure AI Foundry stack as a "Continuous Evaluation" feature, with the positioning: "every model you deploy gets an ASSERT score in your dashboard, and you can pin the floor in your procurement contract." That's the first time a major cloud vendor has shipped an "evals SLA" as a thing you can buy.

The Take

The 80–90% human agreement number is the most important data point in enterprise AI right now. If you can't put a number like that on your AI grader — whether it's grading code, grading customer support replies, or grading anything else — you're flying blind. The labs that have a meta-eval layer (Microsoft, Anthropic, Google) are starting to use it as a moat: "we don't just ship a model, we ship a model plus a way to measure how good it is at your task." That's a real procurement argument, and it changes the buying criteria from "which model scores highest on MMLU" to "which vendor will help me measure quality in my specific use case." Microsoft is leaning into this hard. Watch for the same playbook from Anthropic and Google in the next 60 days.

Quick Summary

Microsoft published its ASSERT framework for grading AI output. Headline: 80–90% agreement with human reviewers. The real play is productizing the meta-eval layer as part of Azure AI Foundry. Enterprise AI procurement is shifting from "best model" to "model + measurement."


Sources:

Related Dispatches