Every few weeks, a model drops with a breathless press release and a benchmark table that conveniently omits the comparison that matters. DeepSeek V4 Pro is not that. It's better, because DeepSeek shipped the model, set a price that makes you do a double-take, and let the numbers speak for themselves.
DeepSeek V4 Pro hit stable public API on May 12, 2026 — yes, this week — via Alibaba Cloud and Venice AI, with DeepSeek's own API following close behind. The pricing is not a typo: .40 per million input tokens, .80 per million output. Let that sink in. GPT-5.5 Pro is 0/80 at the same context length. Opus 4.7 is 5/5. DeepSeek V4 Pro is doing comparable performance at 8-20x lower cost. That's not a price war. That's an extinction event for margin-padding.
DeepSeek V4 Pro is the full weights release of DeepSeek's V4 series. The earlier releases were API-only or preview tiers. This is the open-weights drop — the one you can download, quantize to GGUF, and run on hardware you own.
On the benchmarks that matter and are hardest to cherry-pick: MMLU 87.8%, MMLU-Pro 65.5%, GSM8K 91.1%, HumanEval Pass@1 69.5%. These numbers put it in the same tier as GPT-5.5 and Claude Opus 4.7 on standard academic benchmarks. The agentic benchmark numbers — Terminal-Bench 2.0, tool-calling accuracy, schema adherence — are where it apparently punches above its weight class, matching or exceeding Opus 4.7 on some agentic tasks according to developer reports from the past few days.
The 1 million token context window is real, not interpolated. And unlike some models that advertise long context but degrade sharply past 128k, DeepSeek V4 Pro maintains coherent reasoning through the full context in developer tests. That's partly architecture, partly training data curation, partly the fact that DeepSeek has been shipping long-context models since V2 and has had time to work out the bugs.
Let's talk about what open-weights actually means in 2026, because the definition has shifted.
DeepSeek V4 Pro weights are available. You can download them. You can run them on your own hardware. You can quantize them — Q4_K_M on an M2 MacBook Pro gives you roughly 40 tokens/second on the 8B variant, which is usable. The 70B variant at Q4 needs a serious GPU rig but is well within the reach of anyone running a homelab or a small cloud instance.
The cost comparison that actually matters for production systems: running DeepSeek V4 Pro on your own hardware at scale costs your electricity bill. The API option via Venice AI or Alibaba Cloud at .73/.80 per million tokens is already so cheap that the economics of building a production pipeline around it are favorable versus any proprietary model at comparable quality.
This is the open-source model that actually competes on the dimension that matters: real-world cost per useful output. Not benchmark cost, not synthetic cost — the actual cost of getting a task done in production.
The most practically significant detail from the May 12 release is the tool-use improvement. Developer reports from the past 72 hours consistently note better schema adherence, fewer malformed JSON responses, and more reliable function-calling behavior than earlier DeepSeek models.
This matters because the agentic loop is where most production LLM deployments actually live, and that's also where models tend to fail in subtle, expensive ways. A model that hallucinates a JSON key is a minor annoyance in chat. It's a pipeline failure in an agentic system. If DeepSeek V4 Pro has genuinely closed the gap with Opus 4.7 on tool-use reliability, that's a significant production signal, not just a benchmark talking point.
The strict JSON enforcement is also worth noting: no more model wrapping its response in conversational scaffolding. You get the JSON you asked for, or you get an error. That's the right trade-off for agentic systems, and it's the trade-off most developers actually want.
GPT-6 got more press. Claude Mythos got more Twitter threads. Both are real releases. Neither of them undercuts GPT-5.5 pricing by 10x while matching its benchmark performance.
The pattern here is familiar if you've been watching the open-source model space: DeepSeek releases a model, the response from the community is "this is surprisingly good," the response from the benchmark-set is "the numbers are almost as good," and then six months later the model becomes the default for a whole category of production workloads because it was cheap enough and good enough to be worth switching.
DeepSeek V3 did this. DeepSeek R1 did this for reasoning. DeepSeek V4 Pro is doing it for the general-purpose, agent-capable, long-context workload.
The companies that should be paying attention: anyone charging 0-30 per million tokens for a model that DeepSeek is now offering at .40. That's not a sustainable price differential unless the proprietary model has a moat it can actually defend — and most of them don't, because the open-source models are training on similar data, using similar architectures, and iterating faster.
Benchmarks are still synthetic. MMLU and HumanEval tell you something, but they don't tell you everything about how a model will behave on your specific production workload. The reports from developers over the past few days are more interesting than the benchmark tables, but they're also self-reported and subject to selection bias.
The open-weights release quality depends on quantization. Q4_K_M is the sweet spot for most local deployments, but you should test against your actual workload, not assume the numbers carry over from BF16.
And the model is only as good as the serving infrastructure. If you're self-hosting, you need to actually handle the load. DeepSeek's own API is the easiest path; Venice AI and Alibaba Cloud are reasonable fallbacks. Roll your own inference server only if you have the infra team to support it.
*DeepSeek V4 Pro released May 12, 2026. Open-source weights available. API from DeepSeek (~.40/.80 per 1M tokens), Venice AI (~.73/.80), Alibaba Cloud (~.40/.80). 1M token context. MMLU 87.8%, MMLU-Pro 65.5%, GSM8K 91.1%, HumanEval Pass@1 69.5%. Best-in-class tool-use and schema adherence per developer reports.