← Back to Payloads
AI Models2026-06-02

Step 3.7 Flash: The First Open-Weights Model That Actually Beats Closed Labs at Agents

StepFun shipped a 198B sparse-MoE vision-language model that takes #1 on ClawEval-1.1 (67.1), costs $0.20/M input, runs on a DGX Spark and a 128GB Mac Studio, and is open-weights on Hugging Face. Yes, really.
Quick Access
Install command
$ mrt install StepFun
Browse related skills
Step 3.7 Flash: The First Open-Weights Model That Actually Beats Closed Labs at Agents

Step 3.7 Flash: The First Open-Weights Model That Actually Beats Closed Labs at Agents

Let me give you the tl;dr first because most of you will skim: StepFun shipped a 198B-parameter sparse MoE vision-language model that takes #1 on ClawEval-1.1 with 67.1, costs $0.20 per million input tokens, runs on a DGX Spark, and you can download the weights on Hugging Face right now. That's the post. Read on for why this isn't another "another Chinese model" footnote.

The "Flash" Branding Is Doing a Lot of Lifting Here

I'll be honest: when I saw the name "Step 3.7 Flash," I assumed we were getting another small distilled model. We're not. This is a 198-billion-parameter sparse MoE with 288 experts, 8 active per forward pass, pulling roughly 11B active parameters per token. The "Flash" refers to throughput — StepFun quotes up to 400 tokens per second on the right hardware — not parameter count. That's a fair naming choice, but be careful when you compare this to a 7B "Flash" variant from elsewhere. The total parameter budget here is roughly 9x larger than a Llama 3 70B.

The vision side is a separate 1.8B ViT bolted onto the language backbone. It's native, not a wrapper — image and text are fused at the model level, which matters for the document intelligence stuff I'll get to below.

The Benchmarks That Matter

I don't care about your MMLU score. Nobody running production cares. The interesting numbers:

  • #1 on ClawEval-1.1: 67.1 — the next closest competitor scored 59.8. That's a 7+ point lead on what is currently the least game-able agent benchmark I trust. ClawEval specifically tests multi-turn tool orchestration against adversarial traps and policy drift.
  • #1 on SimpleVQA Search: 79.2 — visual search with retrieval-augmented reasoning, not just "describe the image."
  • #2 on SWE-Bench PRO: 56.3 — multi-file repo debugging from raw issue reports. Second to whoever's first, in front of most of the rest.
  • **V* (Python) at 95.3** — frontier parity on the V* visual reasoning suite.
  • Toolathlon: 49.5, HLE w. Tool: 48.1 — long-horizon tool workflows.
  • Terminal-Bench 2.1: 59.5, GDPVal-AA: 45.8 — the two "real work" benchmarks where the model is decent but not a frontier-leader.

The honest read: Step 3.7 Flash is the best open-weights model I've seen for agentic orchestration specifically, and the 7+ point ClawEval lead is real, not noise. It's not the best model for pure coding at the absolute frontier — Claude Opus 4.8 and GPT 5.5 still win those — but for a multi-tool agent that needs to call APIs, run searches, verify, and not hallucinate permissions, this is the strongest thing I can run locally or self-host right now.

The Pricing Is the Real Story

Look at the API pricing, because this is what made me sit up:

  • Input (cache miss): $0.20 per million tokens
  • Input (cache hit): $0.04 per million tokens
  • Output: $1.15 per million tokens

Compare that to Claude Opus 4.8 at roughly $15 / $75 per million. Step 3.7 Flash is 75x cheaper on input and 65x cheaper on output than the closest comparable closed model. And with 80% cache hit rates on a long-running agent, your effective input cost is $0.04/M, which is a rounding error.

This is the first time in 2026 I've seen a frontier-adjacent model priced in the "you can actually run this in production" range without fine-tuning or aggressive distillation. The cost arithmetic just changed.

Three Reasoning Tiers, Not Magic

StepFun shipped three configurable reasoning levels — low, medium, high — controllable per request. Low is for short-form, latency-sensitive calls. High is for the multi-step search/verify loops. This is exactly the right primitive and I'm surprised more labs don't ship it.

In practice you wire it into your orchestration layer: trivial routing decisions use low, complex multi-tool chains use high, single-tool calls use medium. You can cut effective spend by 40-60% on mixed workloads without losing the model's strongest behaviors. Stop paying for a thinking model to classify a yes/no.

What It Actually Runs On

  • Cloud: StepFun's own platform (Global at platform.stepfun.ai, China at platform.stepfun.com), OpenRouter, NVIDIA NIM. DeepInfra, Fireworks, and Modal are "coming soon."
  • Frameworks: SGLang, TensorRT-LLM, vLLM all have day-1 recipes. Recipes are not "PR open" — they shipped.
  • Local: DGX Station, AMD Ryzen AI Max+ 395 systems, Mac Studio / MacBook Pro with 128GB+ unified memory. Yes, you can run 198B sparse on a Mac Studio. The 11B active parameter budget is what makes this possible.
  • Quantization: NVFP4 checkpoint on Hugging Face for reduced memory bandwidth and storage. Production-ready, not "coming soon."

The Mac Studio support is the bit nobody is talking about and everybody should be. 128GB unified memory, 11B active, sparse MoE — this is the first time a frontier-adjacent model is genuinely usable as a local dev loop without renting an H100.

Where I'd Push Back

A few honest caveats:

1. GDPVal-AA at 45.8 means it's not the best model for long, structured professional deliverables. If you're generating 50-page reports, look elsewhere. 2. The "100% tool call success rate" claim floating around a few blog reviews is from a single DGX Spark test on a small benchmark. Don't generalize that. StepFun's own published numbers on Toolathlon and HLE w. Tool are good but not perfect. 3. Chinese-language coverage is good, English ecosystem is still catching up. Most of the integrations, evals, and tooling will be Sinophone-first for a few months. 4. No multi-image or video generation. Vision is input-only. If you need image generation, this isn't your model.

My Take

Step 3.7 Flash is the first 2026 open-weights release where I'm genuinely asking "why am I paying for the closed model?" for agent workloads. The combination of #1 ClawEval, $0.20 input pricing, 256K context, 400 t/s throughput, and 128GB-Mac-Studio local support is not something you find in one model from one lab very often. StepFun is not a household name in the West, but this release is the moment I'd start treating them like one.

If you're building an agent stack in 2026, the right move is to wire Step 3.7 Flash into the same routing layer as your current flagship. Cheap for the 90% case, escalate to expensive for the 10% that actually needs a frontier coding model. The math finally works.


Source: StepFun model card (huggingface.co/stepfun-ai/Step-3.7-Flash), NVIDIA Developer Blog post on enterprise multimodal deployment (developer.nvidia.com/blog/run-step-3-7-flash-on-nvidia-gpus-with-enterprise-ready-multimodal-ai), StepFun launch announcement (static.stepfun.com/blog/step-3.7-flash), Kingy AI launch tracker for May 31, 2026, and a third-party DGX Spark benchmark review (flowtivity.ai).

Related Dispatches