On July 2, 2026, Poolside open-weighted Laguna XS 2.1: a 33B total / 3B active MoE, 256 experts with top-8 routing, mixed 3:1 sliding-window + global attention, native FP8 KV cache, 256K context, and a separate DFlash speculator that doubles local tok/s. SWE-bench Multilingual 63.1% (+5.4 vs XS.2), SWE-bench Verified 70.9%, $0.10/$0.20 per MTok — half of Haiku 4.5. The first open-weights coding model I'd ship an agent on without hedging.

Poolside's Laguna XS 2.1 Is the First Open-Weights 33B MoE Built for Agentic Coding, and It Beats Claude Haiku 4.5

Hey guys, Mr. Technology here.

On July 2, 2026, Poolside dropped Laguna XS 2.1 — a 33B-total / 3B-active MoE purpose-built for agentic coding. Weights are public on Hugging Face under OpenMDW-1.1, with FP8/INT4/NVFP4 quantizations, support in vLLM 0.21.0+, SGLang, TensorRT-LLM, HF Transformers, and Ollama, plus a free OpenRouter endpoint. The first open-weights coding model I'd ship a real agent loop on, no hedge.

The Architecture Is the Story

Laguna XS 2.1 is not a generic MoE rewrapped for code. The config is opinionated for one workload: long-horizon agentic coding where the model has to keep a 200K-token repo context warm while firing dozens of tool calls per turn.

256 experts + 1 shared expert, top-8 routing. The shared expert handles the residual signal; the 256 routed absorb language idioms, framework APIs, and error-message → fix mappings. Top-8 active per token means ~31% of routed experts fire per forward pass. With 3B active, per-token compute is roughly equivalent to a 3B dense model — pay for capacity, pay only for the route.

Mixed 3:1 sliding-window + global attention across 40 layers. 30 SWA layers (window=512) interleaved with 10 global layers, each with per-layer rotary scaling. Poolside dialed the ratio to 1:3 — the agent loop needs to reference files at arbitrary positions, not just recent tokens.

Native FP8 KV cache. On a 256K context, a BF16 KV cache consumes tens of GB; the FP8 cache cuts that roughly in half — the difference between one H100 and a multi-GPU box. INT4 checkpoints run on a 24GB consumer GPU.

3B active compute per token, 33B total memory, 256K context — runnable on a workstation, one H100, or scaled across a fleet.

The Benchmarks Are Real

Poolside ran the evals through the Laude Institute's Harbor Framework with their own agent harness, max 500 steps, sandboxed, 8 GB RAM / 2 CPU per task (48 GB / 32 CPU for Terminal-Bench 2.0), temperature=1.0, top_k=20, top_p=1, thinking mode on, 256K context. Mean pass@1 averaged over multiple attempts. They also ran a reward-hack judge post-hoc.

The headline moves versus Laguna XS.2:

SWE-bench Multilingual: 57.7% → 63.1% (+5.4 points — the largest single-benchmark jump)
SWE-bench Verified: 69.9% → 70.9% (+1.0 point — saturated benchmark)
SWE-Bench Pro: above Claude Haiku 4.5 and MAI-Code-1-Flash (137B dense), competitive with gpt-oss-120b
Terminal-Bench 2.0: ahead of Haiku 4.5, roughly tied with Qwen3.6 35B-A3B

The 5.4-point Multilingual jump is the one to internalize. Most open-weights coding models lose 10–15 points from Verified to Multilingual. Laguna XS 2.1 loses 7.8 points (70.9 → 63.1). The smallest gap in the 33B-A3B class. Poolside is training on multilingual code trajectories, not just translating bug reports.

Against Claude Haiku 4.5 (the previous default), roughly tied on SWE-Bench Pro and ahead on Terminal-Bench 2.0. Against GPT-5.4 Nano, ahead on every benchmark. Against gpt-oss-120b (4x the active compute), within striking distance on three of four.

DFlash: The Local-Inference Unlock

Poolside also released DFlash — a 5-layer Llama-style draft model that proposes up to 15 speculative tokens per step. End-to-end speedup: 1.67x–2.64x, mean accepted length 3.55–4.57 tokens per step. On an M-class Mac or a consumer GPU, DFlash is the difference between "tolerable" and "snappy." On an H100, the difference between 60 and 150 tok/s effective decode.

The most complete open-weights agent inference stack I have seen in 2026 — native tool calling (Poolside's XML-style poolside_v1 protocol, parsed automatically), interleaved reasoning (toggled per-request with enable_thinking), and speculative decoding in one package.

The License Is the Second Headline

Laguna XS 2.1 ships under OpenMDW-1.1 — a new permissive license the Linux Foundation published with NVIDIA's backing, explicitly designed for model and artifact distribution. No Apache 2.0 retrofit, no custom "OpenRAIL" clauses, no acceptable-use policy. You can use, modify, quantize, fine-tune, distill, redistribute, and serve it commercially without a separate license negotiation. OpenMDW is going to be the Apache 2.0 of the model era.

The Pricing Is Half What You'd Expect

Poolside serves Laguna XS 2.1 at $0.10 / $0.20 / $0.05 per MTok (input / output / cache-read). Free on OpenRouter. Price matched to XS.2 — no premium for the new release.

Model	Input $/MTok	Output $/MTok
Claude Haiku 4.5	$1.00	$5.00
GPT-5.4 Nano	$0.20	$0.80
gpt-oss-120b	$0.20	$0.80
Qwen3.6 35B-A3B	$0.20	$0.60
Laguna XS 2.1	$0.10	$0.20

5x cheaper than Haiku 4.5 on input, 25x cheaper on output, and beats it on the agentic coding benchmarks that matter. That is not a price war. That is a price detonation.

The Limits Worth Naming

The tool-call format is Poolside-specific (poolside_v1 XML, not OpenAI-style). vLLM parses it, but third-party harnesses need a 50-line adapter.

256K context is server-side only — local INT4 on 24GB GPU tops out at 32K–64K. Full 256K needs an H100.

The eval is Poolside's own harness. Harbor is real and Laude is credible, but my priors are 80/20 the +5.4 Multilingual holds within ±1 point when the community re-runs it.

The Take

Laguna XS 2.1 is the first open-weights model I would ship a cost-sensitive coding agent on in 2026 without a hedge. The 33B-A3B MoE with 256 experts and top-8 routing is the right architecture for long-horizon agentic work. The mixed 3:1 SWA + global attention is the right attention pattern. The native FP8 KV cache is the right memory tradeoff. The DFlash speculator is the right local-inference unlock. The OpenMDW-1.1 license is the right open-weights license. The price is the right price. Poolside shipped the complete package.

The closed labs still own the frontier — Sonnet 5, GPT-5.5, Mythos 5 are all a tier above Laguna XS 2.1 on raw agentic ceiling. But the small-model tier — the Haikus, the GPT-5.4 Nanos, the GPT-OSS-120bs — just lost its monopoly on "the cheap model you can ship an agent on." Laguna XS 2.1 is open-weights, self-hostable, fine-tunable, quantizable, and 5–25x cheaper than the equivalent closed small model. The small-model tier is the new battleground, and Poolside won the first round.

Use it for cost-sensitive coding agents, regulated environments, or self-hosted coding products. Skip it for Mythos- or GPT-5.5-class work. Try it on a single H100 with vLLM 0.21.0; the free OpenRouter endpoint is one curl. The first open-weights coding release in 2026 I would not pre-emptively discount.

— Mr. Technology

*Released: July 2, 2026. Model: poolside/Laguna-XS-2.1. Architecture: 33B total / 3B active MoE, 256 experts + 1 shared, top-8 routing, 40 layers, mixed 3:1 SWA (window=512) + global attention, native FP8 KV cache. Context: 256K served. License: OpenMDW-1.1. Quantizations: BF16, FP8, INT4, NVFP4. Pricing: $0.10 / $0.20 / $0.05 per MTok. Speculative decoding: DFlash (1.67x–2.64x speedup). Benchmarks (Harbor Framework, Poolside harness): SWE-bench Verified 70.9%, SWE-bench Multilingual 63.1% (+5.4), SWE-Bench Pro competitive with Haiku 4.5, Terminal-Bench 2.0 ahead of Haiku 4.5. Runtimes: vLLM 0.21.0+, SGLang, TensorRT-LLM, HF Transformers, Ollama. Sources: Poolside, vLLM recipe, HF, OpenRouter, OpenMDW.*