
Let me give you the tl;dr first because most of you will skim: StepFun shipped a 198B-parameter sparse MoE vision-language model that takes #1 on ClawEval-1.1 with 67.1, costs $0.20 per million input tokens, runs on a DGX Spark, and you can download the weights on Hugging Face right now. That's the post. Read on for why this isn't another "another Chinese model" footnote.
I'll be honest: when I saw the name "Step 3.7 Flash," I assumed we were getting another small distilled model. We're not. This is a 198-billion-parameter sparse MoE with 288 experts, 8 active per forward pass, pulling roughly 11B active parameters per token. The "Flash" refers to throughput — StepFun quotes up to 400 tokens per second on the right hardware — not parameter count. That's a fair naming choice, but be careful when you compare this to a 7B "Flash" variant from elsewhere. The total parameter budget here is roughly 9x larger than a Llama 3 70B.
The vision side is a separate 1.8B ViT bolted onto the language backbone. It's native, not a wrapper — image and text are fused at the model level, which matters for the document intelligence stuff I'll get to below.
I don't care about your MMLU score. Nobody running production cares. The interesting numbers:
The honest read: Step 3.7 Flash is the best open-weights model I've seen for agentic orchestration specifically, and the 7+ point ClawEval lead is real, not noise. It's not the best model for pure coding at the absolute frontier — Claude Opus 4.8 and GPT 5.5 still win those — but for a multi-tool agent that needs to call APIs, run searches, verify, and not hallucinate permissions, this is the strongest thing I can run locally or self-host right now.
Look at the API pricing, because this is what made me sit up:
Compare that to Claude Opus 4.8 at roughly $15 / $75 per million. Step 3.7 Flash is 75x cheaper on input and 65x cheaper on output than the closest comparable closed model. And with 80% cache hit rates on a long-running agent, your effective input cost is $0.04/M, which is a rounding error.
This is the first time in 2026 I've seen a frontier-adjacent model priced in the "you can actually run this in production" range without fine-tuning or aggressive distillation. The cost arithmetic just changed.
StepFun shipped three configurable reasoning levels — low, medium, high — controllable per request. Low is for short-form, latency-sensitive calls. High is for the multi-step search/verify loops. This is exactly the right primitive and I'm surprised more labs don't ship it.
In practice you wire it into your orchestration layer: trivial routing decisions use low, complex multi-tool chains use high, single-tool calls use medium. You can cut effective spend by 40-60% on mixed workloads without losing the model's strongest behaviors. Stop paying for a thinking model to classify a yes/no.
The Mac Studio support is the bit nobody is talking about and everybody should be. 128GB unified memory, 11B active, sparse MoE — this is the first time a frontier-adjacent model is genuinely usable as a local dev loop without renting an H100.
A few honest caveats:
1. GDPVal-AA at 45.8 means it's not the best model for long, structured professional deliverables. If you're generating 50-page reports, look elsewhere. 2. The "100% tool call success rate" claim floating around a few blog reviews is from a single DGX Spark test on a small benchmark. Don't generalize that. StepFun's own published numbers on Toolathlon and HLE w. Tool are good but not perfect. 3. Chinese-language coverage is good, English ecosystem is still catching up. Most of the integrations, evals, and tooling will be Sinophone-first for a few months. 4. No multi-image or video generation. Vision is input-only. If you need image generation, this isn't your model.
Step 3.7 Flash is the first 2026 open-weights release where I'm genuinely asking "why am I paying for the closed model?" for agent workloads. The combination of #1 ClawEval, $0.20 input pricing, 256K context, 400 t/s throughput, and 128GB-Mac-Studio local support is not something you find in one model from one lab very often. StepFun is not a household name in the West, but this release is the moment I'd start treating them like one.
If you're building an agent stack in 2026, the right move is to wire Step 3.7 Flash into the same routing layer as your current flagship. Cheap for the 90% case, escalate to expensive for the 10% that actually needs a frontier coding model. The math finally works.
Source: StepFun model card (huggingface.co/stepfun-ai/Step-3.7-Flash), NVIDIA Developer Blog post on enterprise multimodal deployment (developer.nvidia.com/blog/run-step-3-7-flash-on-nvidia-gpus-with-enterprise-ready-multimodal-ai), StepFun launch announcement (static.stepfun.com/blog/step-3.7-flash), Kingy AI launch tracker for May 31, 2026, and a third-party DGX Spark benchmark review (flowtivity.ai).