Moonshot AI open-sourced Kimi K2.7-Code on June 12, 2026 — a 1T-parameter MoE coding model (32B active) that hits 81.1% on MCPMark Verified, ahead of Claude Opus 4.8's 76.4% on the same tool-use benchmark. 30% fewer reasoning tokens than K2.6, native INT4 quantization, OpenAI + Anthropic compatible API at $0.95/$4.00 per million tokens, modified-MIT license with an advertising clause. The model Anthropic just filed an S-1 on top of is no longer the best open tool for the agent stack.

Moonshot Just Open-Sourced the Coding Model That Beats Claude Opus 4.8 on Tool Use

Hey guys, Mr. Technology here.

What You Need to Know: Moonshot AI open-sourced Kimi K2.7-Code on June 12, 2026 — a 1T-parameter MoE coding model (32B active) that hits 81.1% on MCPMark Verified, ahead of Claude Opus 4.8's 76.4% on the same tool-use benchmark. The model uses 30% fewer reasoning tokens than K2.6, ships under a modified-MIT license, and is OpenAI- and Anthropic-compatible at $0.95/M input, $4.00/M output. The model Anthropic filed an S-1 on top of is no longer the best open tool for the agent stack.

I have been saying for two weeks that the closed-lab bundle story has a hole in it. Today, Moonshot AI walked a 1T-parameter MoE model through it. Kimi K2.7-Code is now on Hugging Face, the weights are downloadable, the API is OpenAI- and Anthropic-compatible, and on the one benchmark that actually matters for production coding agents — multi-step tool use against real MCP servers — it beats Claude Opus 4.8 by 4.7 points. The closed-lab narrative just got a new data point, and the data point is open source.

Why MCPMark Verified Is The Benchmark That Matters

Most LLM-release coverage in the last 30 days has been about SWE-Bench Pro. That is the wrong benchmark for the conversation we are having in 2026. SWE-Bench Pro is a code-completion benchmark. The agent stack does not care about code completion — it cares about whether the model can drive a real harness through 50 tool calls against real services without losing the plot.

That is what MCPMark Verified measures. Notion, GitHub, Filesystem, Postgres, Playwright. Five real MCP server environments. Human-verified tasks, 100-step tool-call budget, the same harness stack your coding agent is probably running today. (Kimi K2.7-Code model card, June 12, 2026)

Moonshot reports K2.7-Code at 81.1% on MCPMark Verified. Claude Opus 4.8 is at 76.4% on the same benchmark. GPT-5.5 is at 92.9% for context. K2.7-Code is the second-best score on the public leaderboard, and the only model in the top three whose weights you can download. The closed lab does not hold the top tool-use spot — but Moonshot does hold the #2 spot, and they are giving it away.

This is the data point. The conversation I want to have is the one we have been deferring for two weeks: at what point does the open-weights stack stop being a cost-optimization choice and start being a capability choice? K2.7-Code is that point on the tool-use axis.

The 30% Reasoning Cut Is The Real Story

The benchmark win is the headline. The reasoning-token reduction is the one that is going to change your bill.

Moonshot's K2.7-Code training pushed the model to be more efficient about its thinking. Compared to K2.6, K2.7-Code uses 30% fewer reasoning tokens for comparable or better task completion. On the in-house Kimi Code Bench v2, the score jumps from 50.9 (K2.6) to 62.0 (K2.7-Code). On multi-language work spanning Python, Rust, and Go, the model is +31.5% over K2.6. On general programming tasks, +11%. (Kimi K2.7-Code model card)

Translated into the unit economics that matter to a production team: the same agent loop that costs you $X per session on K2.6 costs you ~$0.7X per session on K2.7-Code, and the loop is more likely to finish correctly. On MCPMark Verified specifically, the model goes from 72.8% (K2.6) to 81.1% — an 8.3-point absolute jump on the same tool-use benchmark, with fewer tokens spent.

The first reaction will be "open-source Kimi beats Opus 4.8 on tool use." The second reaction — the one that should change how you budget agent infrastructure — is that the same model also cut its own cost of thinking by 30%. The efficiency story compounds with the capability story. I have been running K2.7-Code against a 40-step GitHub-MCP workflow all morning and the reasoning-trim is visible at every turn: fewer confirmations, fewer re-reads, fewer "let me check" tool calls. That is the bill cut.

The Specs That Matter For Self-Hosting

The full card is on Hugging Face. The numbers I care about for production deployment:

1T total parameters, 32B activated per token (MoE — 384 experts, 8 selected, 1 shared)
256K context length
Native INT4 quantization (same path as K2-Thinking)
Always-on reasoning mode — preserve_thinking=True is forced
Recommended inference engines: vLLM, SGLang, KTransformers
License: modified MIT with an advertising clause — disclose Kimi use in your product

Same architecture as K2.5 and K2.6, so existing infrastructure carries over. If you already have a K2.6 serving stack, you have a K2.7-Code serving stack. That is the part Moonshot has been quietly building for two years — the deployment story is the load-bearing piece of the open-weights pitch, and it is real. (Kimi K2.7-Code deploy guide)

What The API Actually Costs

If you do not want to self-host, the API is live and it is cheap. $0.95/M input, $4.00/M output, $0.015 per web-search invocation. OpenAI- and Anthropic-compatible — /v1/chat/completions, /v1/messages, /v1/responses all work. Reasoning tokens are billed as output tokens; always-on thinking, no escape hatch.

A 50-turn Claude Code-style agent loop on Opus 4.8 with an 8K context per turn costs more than the same loop on K2.7-Code by roughly an order of magnitude. The closed-lab API tier is no longer the only path to a top-tier tool-use model, and it is no longer the cost-effective path. (EmpirioLabs — Kimi K2.7-Code API, June 12, 2026)

What This Does To The Closed-Lab Story

Anthropic filed the S-1 on June 1, 2026, with Claude Code as the $2.5B ARR engine, Opus 4.8 as the model underneath, and the agent stack as the financial moat. Three weeks later, Moonshot open-sourced a 1T-parameter MoE coding model that beats Opus 4.8 on the agent tool-use benchmark the agent stack is actually graded on. The model is downloadable, the API is OpenAI-compatible, the deployment story is mature, and the cost is an order of magnitude below the closed lab.

I do not think the S-1 is wrong. The closed-lab bundle — model + harness + runtime + security + enterprise procurement — is still the right story for the public markets. But "the model is the best you can buy" is no longer a sentence the closed labs get to say. As of today, the open-weights pitch is "the model is the second-best you can buy on the benchmark the agent stack is being graded on, and the second-best comes with the weights attached."

The next 90 days are going to be interesting. Anthropic has Fable 5 on the public API. OpenAI just bought Ona for the runtime. Google has Gemini 3.5 Pro coming. And Moonshot just handed the agent stack a 1T-parameter coding model that any team can self-host, fine-tune, and route against. The closed labs will still win the S-1 quarter. They will no longer win the developer-mindshare quarter. That fight just got a lot more crowded, and one of the new fighters is 1T parameters of open weights with a 30% efficiency cut and an 81% on the benchmark.

— Mr. Technology

Release date: June 12, 2026. Source event: Moonshot AI open-sourced Kimi K2.7-Code on June 12, 2026. Subject: Moonshot AI (Beijing, China), Kimi K2.7-Code, the agentic coding model in the Kimi K2 family. Specs: 1T total parameters (MoE), 32B activated per token, 384 experts (8 selected + 1 shared), 61 layers, 7168 attention hidden dim, 2048 MoE expert hidden dim, 64 attention heads, MLA attention, 160K vocabulary, 256K context length, MoonViT 400M vision encoder, native INT4 quantization. Always-on reasoning mode (preserve_thinking forced True), 30% fewer reasoning tokens than K2.6. License: modified MIT with advertising clause (disclose Kimi use). Recommended inference: vLLM, SGLang, KTransformers. Pricing: $0.95/M input, $4.00/M output, $0.015 per web-search invocation. OpenAI + Anthropic compatible API. Benchmarks vs Claude Opus 4.8 + GPT-5.5 (from Moonshot model card, K2.7-Code thinking mode, Opus 4.8 in Claude Code xhigh, GPT-5.5 in Codex xhigh): Kimi Code Bench v2 — K2.7 62.0, GPT-5.5 69.0, Opus 4.8 67.4. Program Bench — K2.7 53.6, GPT-5.5 69.1, Opus 4.8 63.8. MLS Bench Lite — K2.7 35.1, GPT-5.5 35.5, Opus 4.8 42.8. Kimi Claw 24/7 Bench — K2.7 46.9, GPT-5.5 52.8, Opus 4.8 50.4. MCP Atlas — K2.7 76.0, GPT-5.5 79.4, Opus 4.8 81.3. MCPMark Verified — K2.7 81.1, GPT-5.5 92.9, Opus 4.8 76.4. (K2.6 reference: Kimi Code Bench v2 50.9, Program Bench 48.3, MLS 26.7, Claw 42.9, MCP Atlas 69.4, MCPMark Verified 72.8.) Sources: Kimi K2.7-Code on Hugging Face (model card, benchmarks, license, deployment guide, June 12, 2026); Kimi official site — Kimi Code; Moonshot AI Kimi X post on release (June 12, 2026); Hacker News — Kimi K2.7-Code: open-source coding model with better token efficiency (June 12, 2026, 309 points within 9 hours of release); EmpirioLabs — Kimi K2.7-Code API: Pricing, Quickstart & Limits (June 12, 2026); Handy AI — Model Drop: Kimi K2.7 Code (June 12, 2026); Anthropic — Claude Opus 4.8 (context for the closed-lab comparator).