
On June 30, 2026, Huawei dropped openPangu-2.0-Flash on Hugging Face and GitCode. Weights, inference code, and the training and inference operators are all up. This is not another "weights only" open-source drop. The training stack ships with it. That is the headline the launch coverage is sleepwalking past.
The numbers worth knowing:
The Flash SKU is the lead; openPangu-2.0-Pro (505B total / 18B active) follows in July, per Huawei's HDC 2026 roadmap.
There is no API price because you are the runtime. Self-hosting a 92B MoE on Ascend is the cost story. Independent community benchmarks put comparable 70B-class open-weight models around $0.30–$0.50 per MTok on US cloud GPUs (Fireworks, Together). Self-host on Ascend 910B/910C and the unit economics change: Huawei claims single-card throughput 2× mainstream open-source peers.
For comparison against published API-priced peers:
| Model | Params (T/A) | Context | SWE-bench Verified | AIME 2026 (w/ Python) |
|---|---|---|---|---|
| openPangu-2.0-Flash (Thinking) | 92B / 6B | 512K | 63.1% | 98.1% |
| openPangu-2.0-Flash (Non-Thinking) | 92B / 6B | 512K | 57.6% | — |
| Qwen 3.7-Plus | open MoE | 1M | ~58% (third-party) | ~88% (third-party) |
| Claude Sonnet 5 (June 30) | closed | 1M | ~63.2% (Anthropic) | — |
A 6B-active MoE matching Sonnet 5 on SWE-bench Verified and beating it on AIME 2026 with Python is not a "Chinese copy" headline. It is the first time an open-weight MoE has crossed the line on coding agent evals that matter to Western engineering teams.
The catch is the license. OpenPangu License v2.0 is permissive for research and commercial use but includes a use-based clause Huawei reserves the right to enforce — similar in spirit to the OpenRAIL family. The weights themselves are free; the legal envelope is not Apache.
1. DSA + SWA 1:2 layer split, not a bolt-on. Most long-context models glue sliding-window attention on top of full attention and call it a day. Pangu interleaves sparse global aggregation (DSA) and local-window (SWA) in a 1:2 ratio so global layers only pay attention cost when it earns its keep. The README's claim: compute, memory footprint, and memory-access cost all drop on long-context inference without measurable accuracy loss. If this holds up at third-party eval, it is the most practical long-context trick of 2026.
2. 4-stream mHC residual topology. Standard transformers have one residual path. Pangu swaps that for four parallel residual streams, claiming better representation diversity and generalization. mHC is one of the more interesting architectural bets out of any Chinese lab in the last 12 months.
3. Three-head MTP for self-speculative decoding. Multi-Token Prediction drafts three future tokens per step from a single forward pass. Pangu trained the MTP heads with the main objective, so drafts are accurate enough to use as a built-in speculative decoder. On Ascend this compounds with the DSA/SWA efficiency win. On H100 it depends on how well the kernels port.
The optimizer choice — Muon instead of AdamW — is the bonus. Muon has been quietly producing better convergence in open training runs throughout 2026.
Pulling from the model card (Thinking mode unless noted):
| Benchmark | Metric | Flash-Thinking | Flash-Non-Thinking |
|---|---|---|---|
| AIME 2026 | Avg@16 | 93.3 | 86.5 |
| AIME 2026 w/ Python | Avg@16 | 98.1 | — |
| HMMT Feb 2025 | Avg@16 | 91.5 | 67.1 |
| IMO-AnswerBench | Acc | 76.5 | 62.3 |
| GPQA-Diamond | Avg@4 | 83.7 | 79.8 |
| LiveCodeBench V6 | Avg@3 | 85.1 | 50.9 |
| SWE-bench Verified | Avg@3 | 63.1 | 57.6 |
| TAU2-Bench (agent) | Avg@3 | 88.0 | 74.0 |
| MCP-Atlas (agent) | Acc | 58.9 | 47.9 |
| BrowseComp (agent) | Acc | 57.0 | — |
The 34-point gap on LiveCodeBench V6 between Thinking and Non-Thinking (85.1 vs 50.9) is unusual. Most MoEs separate by 15–25 points. Pangu either baked reasoning into the slow path unusually well, or the Non-Thinking path is being measured too cold. Either way, Thinking mode is the one to benchmark.
Read 1: The export-control clock just sped up for everyone. Sonnet 5 launched on June 30 too, and Anthropic explicitly tuned it below the cyber-capability threshold that took Fable 5 down on June 12. The same week, Huawei shipped an open-weight 92B MoE with 98.1% AIME w/ Python and 88.0% TAU2-Bench — and gave away the training stack that produced it. The U.S. dual-use story is no longer "China is two years behind." It is "China is shipping open-weight frontier on hardware that does not depend on TSMC." Expect the next round of chip restrictions to widen Ascend-by-name.
Read 2: OpenPangu License v2.0 is the real precedent, not the weights. Apache 2.0 is the comfort blanket of the open-LLM era. Pangu just demonstrated a Chinese hyperscaler can ship frontier-tier weights under a permissive-but-not-Apache license and still land on every Hugging Face trending list within 48 hours. The license pattern is now more important than the parameter count. Watch Alibaba, ByteDance, and Baidu adopt the same template by Q4.
Read 3: 6B active is the new "small." DeepSeek shipped the template; Mistral iterated; Qwen and Pangu are now confirming. A 6B-active MoE on 92B of total capacity is matching models that spend 4–8× more compute per token. Anyone building a serving stack that cannot do MoE-routing well is paying a tax they no longer need to pay.
OpenPangu-2.0-Flash is not the biggest LLM release of the week — Sonnet 5 is. But Flash is the most strategically significant release of the month: a frontier-tier open-weight MoE trained entirely on a non-NVIDIA stack, with the training operators to match, under a license the open-source community will spend the next quarter arguing about. The weights are free. The training stack changes the calculation. The hardware target tells you where this is all heading.
— Mr. Technology
*Released: June 30, 2026. Model: openPangu-2.0-Flash (92B total / 6B active MoE). Pricing: open weights, self-hosted. License: OpenPangu Model License Agreement v2.0 (permissive, not Apache 2.0). Context: 512K. Hardware target: native Ascend NPU; Huawei claims up to 2× per-card throughput vs. comparable open-weight peers. Pretraining: ~34T tokens. Architecture: MLA attention + DSA/SWA 1:2 layered + 4-stream mHC residuals + 3-head MTP + Muon optimizer. Benchmarks (Thinking mode): SWE-bench Verified 63.1%, LiveCodeBench V6 85.1%, AIME 2026 93.3 (98.1 w/ Python), GPQA-Diamond 83.7, TAU2-Bench 88.0. Sources: Hugging Face model card, AIBase news, Pandaily, Tech in Asia, Dealroom, AI Policy Daily, Latent Space AINews.*