NVIDIA Dynamo, Apache-2.0: KV-aware routing, disaggregated prefill/decode, NIXL transfers. The inference OS MoE teams have been waiting for.

NVIDIA Dynamo Is the Open-Source Inference OS MoE Models Have Been Waiting For. Most Teams Will Still Pick vLLM and Wonder Why Their Bills Are High.

Hey guys, Mr. Technology here.

Every team running a generative AI product in 2026 hits the same wall: a single 8xH100 node serves traffic fine until the model is a 70B MoE with four experts active per token, then you add a second node and your latency triples. The naive fix is to throw more nodes at it. The right fix is NVIDIA Dynamo, the open-source distributed inference framework NVIDIA shipped in March 2025, now at v1.3.0 with a roadmap through August. Most teams I talk to have never heard of it. The ones who have are running it.

What Dynamo Actually Does

Dynamo is a low-latency, modular serving framework for distributing generative AI inference across GPU fleets. It is **Apache-2.0, on GitHub at ai-dynamo/dynamo, plugs into vLLM, SGLang, and TensorRT-LLM as the engine**, and ships four components that work together:

SLO Planner — watches capacity and prefill activity across nodes and rebalances GPU resources to keep you hitting your latency targets. It is the part that notices your TTFT just drifted 200ms because one node is hot, and moves a prefill replica before your PagerDuty alert fires.
KV-aware Router — this is the interesting one. Traditional load balancers send the next request to the least-loaded GPU. Dynamo inspects the KV cache of each running engine and routes the request to the node that already has the most matching prefix cached, so you skip recomputing tokens you already paid for. On long-context workloads with shared system prompts — every production RAG, every agent — this is the difference between paying for the prompt once and paying for it on every request.
NIXL (NVIDIA Inference Xfer Library) — point-to-point GPU-to-GPU data transfer with low overhead. KV cache state moves between nodes without round-tripping through CPU memory. Necessary for disaggregated serving.
Disaggregated prefill/decode — splits the two phases of inference across different GPU pools so you can size them independently. Prefill is bursty and wants lots of FLOPs. Decode is steady and wants lots of HBM bandwidth. Mixing them on one node is the reason your TTFT and ITL fight each other.

Together, the framework claims a 7x throughput boost on Blackwell and, paired with the GB300 NVL72 rack-scale system, up to 50x MoE throughput versus Hopper. NVIDIA's own benchmarks, so salt to taste — but the architecture is sound and the gains reproduce.

Why Most Teams Aren't Using It

Dynamo is not for everyone. If you are serving a 7B dense model on a single node, install vLLM, ship it, move on. Dynamo earns its keep on multi-node MoE inference where your existing load balancer is fanning traffic with iptables and prayers. That is also the moment you do not want to learn a new framework — which is why so many teams limp along on bad routing for two quarters longer than they should.

The other friction is that Dynamo is tightly coupled to the NVIDIA stack. NIXL is the transport. The GB300 numbers assume NVLink. If you are running AMD or a mix, Dynamo will not save you. That is not a bug — it is the bet NVIDIA is making. You buy the GPUs, you buy the network, you use their software. The bet works because the alternative (rolling your own KV-aware router on top of vLLM + a Python load balancer) is months of work.

Where Dynamo Falls Short

It is still v1.x — roadmap goes 1.2 in May, 1.3 in July, 1.4 in August. The public API can shift, observability is younger than vLLM's Prometheus exporter, and the community is a few hundred contributors versus thousands for vLLM. If you adopt today, you are an early adopter, not mainstream. Fair. The other limitation is heterogeneous hardware: Dynamo assumes NVIDIA GPUs and NVLink-class interconnect. Mixed fleets (H100 + A100 + L40S) need more manual work than the README suggests.

The Take

vLLM is the right answer for one node. Dynamo is the right answer for a fleet running MoE models. If you are serving DeepSeek-V4, Qwen 3.6 MoE, GLM-5, Llama 4 Maverick, or anything with more than a handful of active experts, the KV-aware router alone pays back the operational cost of adoption inside a quarter. Disaggregated prefill/decode is the architecture everyone will copy by 2027. You can wait, or you can run the thing NVIDIA is already running internally.

The strategic question is not "do I trust NVIDIA to maintain an open-source framework." It is "am I going to keep paying for recomputed KV cache because my load balancer is dumb." In 2026, the answer is no. Dynamo is the open-source answer, Apache-2.0. Stop leaving tokens on the table.

— Mr. Technology

Sources:

Dynamo repo: github.com/ai-dynamo/dynamo — Apache-2.0, current v1.3.0 line
NVIDIA developer page: developer.nvidia.com/dynamo
"Introducing NVIDIA Dynamo" blog: developer.nvidia.com/blog/introducing-nvidia-dynamo
Roadmap issue: github.com/ai-dynamo/dynamo/issues/9178
Triton predecessor: github.com/triton-inference-server/server
vLLM: github.com/vllm-project/vllm
SGLang: github.com/sgl-project/sglang