On June 9, 2026, Cohere released North Mini Code: a 30B mixture-of-experts with 3B active parameters, Apache 2.0, 256K context, and a single-H100 footprint — but the asymmetric RLVR pipeline is what actually breaks new ground.

Cohere Just Shipped a 30B Coding Model That Fits on One H100. The Post-Training Is the Real Story.

Hey guys, Mr. Technology here.

Cohere released North Mini Code on June 9, 2026. It is a 30-billion parameter mixture-of-experts model with only 3 billion active parameters per token, shipped under Apache 2.0, with a 256K context window, a 64K max output, and a minimum hardware bar of one H100 at FP8. It is the first model in Cohere's new "North" family and the company's first model aimed squarely at developers. It beats Qwen 3.5 35B-A3B, Gemma 4 26B-A4B, and Devstral Small 2 24B on the Artificial Analysis Coding Index, and runs 2.8x faster than Devstral Small 2 at the same concurrency and hardware.

The headline number everyone will quote is "30B MoE, 3B active, Apache 2.0, one H100." That number is fine. It is also not the story. The story is what Cohere did to the post-training.

The Spec Sheet, Briefly

Architecture: decoder-only Transformer with sparse MoE
Total / active parameters: 30B / 3B
Attention: interleaved 3:1 sliding-window (RoPE) and global (no positional embeddings)
Experts: 128 routed, 8 active per token, SwiGLU FFN, sigmoid-then-top-k router
Context / output: 256K input, 64K output
License: Apache 2.0
Distribution: Hugging Face, Cohere API, Cohere Model Vault, OpenRouter, plus Ollama / LM Studio / llama.cpp quantizations

The attention pattern is a Cohere staple — they have shipped the 3:1 sliding/global mix on Command models for two years. The expert configuration is unremarkable by 2026 standards. The interesting work is upstream of inference.

The Post-Training Is Where the Moat Is

North Mini Code is post-trained in three phases: a two-stage cascaded SFT followed by RL with verifiable rewards (RLVR). None of that is novel on its own. What is novel is how Cohere sequenced it, what they put in the data mix, and how they ran the RL loop.

Phase 1 — Cascaded SFT, Long-to-Longer

Cohere trained in two SFT stages with different context lengths. Stage one used 64K context; stage two used 128K context. The first stage mix is 70% code, of which 43% is agentic tool-use data and 27% is single-turn competitive or scientific programming. The second stage mix is a 4.5-billion-token slice of only the highest-quality verified samples, with code making up 61% of the trainable tokens.

The "long-to-longer" trick matters. Cohere explicitly calls out that without two-stage training, the 20B non-code tokens in stage one dominated the 1.5B high-quality code tokens in stage two and produced poorer final performance with more behavioral conflict between stages. Anecdotally, training on a near-complete length distribution produced shorter final trajectories than training on a 64K-truncated distribution. That is not a benchmark number. It is a recipe.

The verifiable-task corpus: 70,000+ tasks across 5,000 unique repositories, deduplicated against SWE-Bench and SWE-Bench-Pro source repositories to prevent leakage. The SFT model before any RL hits 80.2% pass@10 on SWE-Bench Verified and 55.1% pass@10 on Terminal-Bench v2.

Phase 2 — Asynchronous RL with CISPO

This is the part I have been waiting for. Coding-agent rollouts are highly variable — the slowest trajectory is routinely an order of magnitude longer than the median. A synchronous RL loop would idle the trainer waiting on the longest rollouts, so Cohere decouples sampling from learning: a trainer runs alongside a vLLM sidecar that serves rollouts continuously, with policy weights exporting every K=4 learner steps. To prevent blocking on the longest trajectories, Cohere uses a windowed FIFO queue — a small head is consumed in completion order to drain stragglers, the rest stays in input order. Empirically this recovers most of the throughput of a strict completion-order scheme without measurably hurting stability.

The loss function is CISPO — a log-likelihood objective with token-level importance sampling correction. CISPO differs from PPO and GRPO in that the importance weight multiplies a log-likelihood rather than a probability ratio. It is RLOO with stronger regularization. Loss is aggregated at the token level, so the gradient signal scales with trajectory length and long agentic traces are not down-weighted.

A single multi-environment RL run spans terminal-based and software-engineering tasks in the same training batch — 512 rollouts, group size 8, 128K-token global context. Each task gets a distinct per-step budget calibrated by pass@k filtering before RLVR — Cohere observed that granting a model a turn budget substantially larger than necessary encourages unnecessary verbosity and "hoppiness." The reward is binary, derived from unit-test verifiers, with 0 for invalid tool calls or unparseable outputs. That last clause is what makes hallucinated or malformed tool-call rates drop sharply in the first training steps.

The training result: +7.9% absolute pass@1 on Terminal-Bench v2 and +3.0% on SWE-Bench over the SFT initialization, on top of shorter trajectories and fewer failing tool calls. The final model exhibits less repetitive tool-call looping and reliably concludes by submitting a solution.

The Numbers

Three seeds per benchmark, temperature 1.0, top_p 0.95, harness versions pinned. Methodology disclosed end to end.

Benchmark	North Mini Code (30B-A3B)	Qwen 3.5 (35B-A3B)	Gemma 4 (26B-A4B)	Devstral Small 2 (24B)
AA Coding Index	33.4	lower	lower	lower
SWE-Bench Verified (pass@10)	80.2%	—	—	—
Terminal-Bench v2 (pass@10)	55.1%	—	—	—
Total params	30B (3B active)	35B (3B active)	26B (4B active)	24B (dense)

On the Coding Index, North Mini Code also beats substantially larger models in the same eval: Nemotron 3 Super (120B-A12B), Mistral Small 4 (119B-A6B), and Devstral 2 (123B dense). Smaller active footprint, more total capacity, better index score. That is the entire pitch in one row.

Speed, From Cohere's Internal Numbers

Metric	North Mini Code	Devstral Small 2	Delta
Output throughput (high concurrency)	2.8×	1.0×	+180%
Output throughput (low concurrency)	2.5×	1.0×	+150%
Inter-token latency	30% lower	baseline	+30%
Time-to-first-token	slightly slower	baseline	−5%

The TTFT trade-off is real. MoE has routing overhead. The penalty is paid once at the start; the throughput gain compounds over every token after. For a 1,000-token generation Cohere's math works out to ~3.5 seconds vs ~10 seconds. For a 10-file batch, 35 seconds vs 100 seconds.

Cross-Harness Robustness Is the Underrated Win

North Mini Code was trained against multiple agent scaffolds rather than optimized for one. The second SFT stage included data from SWE-Agent, mini-SWE-Agent, OpenCode, and Terminus 2. Only 6% of the SFT mix is from the chosen SWE-Agent harness; the rest comes from a mix. The result: a 10% gain on the OpenCode harness evaluation with no degradation on SWE-Agent.

The headline number from cross-harness training: 61.0% pass@1 on mini-SWE-Agent, where the model only sees a single bash tool and raw stdout as feedback. The improvement emerged for free, suggesting that harnesses with overlapping tool capabilities share enough representational structure for positive transfer. Skills required by different harnesses are complementary, not contradictory.

The practical implication: you can swap your coding agent's harness (Claude Code, Codex, Aider, OpenCode, your own) and North Mini Code's performance does not collapse. That is the difference between a model that wins a benchmark by overfitting to one scaffold and a model that is genuinely usable in production agent stacks.

What It Means for Builders

Local coding agents become cheap. A 30B MoE at 3B active inference fits on a single H100, runs on 2× H100 with full 320K context, and ships under Apache 2.0. Self-host behind your firewall, no per-token API bill, no data leaving your network. The 2.8x throughput advantage over Devstral Small 2 means your inference fleet does 2.8x the work at the same hardware cost. For teams running thousands of agent turns per day, that is the difference between a viable business model and a quarterly loss.

Sovereign coding is a real category now. Cohere is positioning North Mini Code explicitly as a "sovereign AI" model — on-premises, private cloud, no vendor lock-in. With Command A+ as the enterprise generalist and North Mini Code as the developer specialist, Cohere is building the closest thing the open-weights world has to an OpenAI/Anthropic dual-track strategy. The difference is the license.

The agent harness is now a first-class training surface. Cohere's cross-harness training is the technique to watch. Most open coding models are still optimized for a single evaluation harness and collapse when you change scaffolds. North Mini Code is the first model I have seen where the cross-harness data mix is treated as load-bearing. Expect Qwen, DeepSeek, and Mistral to copy this within two release cycles.

The RLVR loss is moving. CISPO has been on arXiv for a while, but this is the first time I have seen it used as the primary objective for a frontier-scale coding agent. Token-level loss aggregation with importance sampling correction is the right answer for long agentic trajectories. Expect it to show up in the next wave of open-weights coding models.

The Take

North Mini Code is the most important open-weights coding model release of June 2026, and the reason is not the 30B/3B numbers. The reason is that Cohere is publishing the entire training recipe — the long-to-longer cascaded SFT, the 70K-task 5K-repo evaluation surface, the cross-harness data mix, the async RL loop with windowed FIFO queue, the CISPO objective, the per-task turn budget calibrated by pass@k filtering, the binary reward with explicit zero on invalid tool calls. The Hugging Face model card is 12 pages long because there is a 12-page story to tell.

The Chinese open-weights labs (DeepSeek, Qwen, Moonshot) will continue to win on raw benchmark numbers. Cohere is not playing that game. Cohere is competing on post-training engineering — the unglamorous, infrastructure-heavy, infrastructure-pays-off work that nobody tweets about and everybody eventually copies. North Mini Code is the proof point.

If you are building coding agents, download the weights this week, run them through your eval, and pay attention to the cross-harness numbers more than the headline index. The 2.8x throughput advantage at the same hardware cost is what makes the model a fit for production. The Apache 2.0 license is what makes it a fit for procurement. The cross-harness training is what makes it a fit for the messy real world of agent stacks that change every quarter.

I have been waiting for a model that treats the agent harness as part of the model. Cohere just shipped it.

— Mr. Technology

Released: June 9, 2026. Model: Cohere North Mini Code 1.0, 30B MoE (128 experts, 8 active, SwiGLU), 3B active parameters per token, 256K context, 64K max output, Apache 2.0. Architecture: decoder-only Transformer, 3:1 sliding-window (RoPE) / global (no positional embeddings) attention, sigmoid-then-top-k routing, one dense layer before sparse MoE. Post-training: two-stage cascaded SFT (64K then 128K context) followed by asynchronous RL with CISPO loss, windowed FIFO queue, 512-rollout batches, group size 8, binary unit-test reward with 0 for invalid tool calls. Evaluation: SWE-Bench Verified 80.2% pass@10, Terminal-Bench v2 55.1% pass@10, Artificial Analysis Coding Index 33.4. Speed: 2.8x output throughput vs Devstral Small 2 at same hardware, 30% lower inter-token latency, slightly slower TTFT. Hardware minimum: 1× H100 at FP8. Sources: Hugging Face — Cohere North Mini Code model card, Hugging Face blog — Introducing North Mini Code, Cohere Docs — North Mini Code 1.0, MarkTechPost — Meet North Mini Code, ExplainX — Cohere North Mini Code deep dive.