Almost every production LLM gateway I have audited this year does the same thing. It looks at the current prompt, runs a cheap classifier, picks a model, and forwards. Fine abstraction for a chatbot. Wrong abstraction for an agent.
A coding agent does not send prompts. It sends a trajectory: plan, tool call, tool result, fix, re-run, continue, fix the failing case, run that again. Each turn is meaningful only because of the session that came before it. A single-turn router that sees a short follow-up and re-runs its selector is breaking the session three ways at once: sending a tool result to a model that did not ask for it, sending a non-portable continuation id to a different physical backend, and throwing away a warm prefix cache because the latest user message is two words long.
vLLM Semantic Router v0.3, codenamed Themis and shipped June 5, 2026, is the first open-source router that takes that problem seriously. Iris (January) and Athena (March) set up composable signals and a richer model-selection policy. Themis adds the thing the stack has been missing: Session-Aware Agentic Routing (SAAR).
Across 21,600 deterministic turns, SAAR cuts model switches by 79.29%, eliminates 3,836 unsafe switches, and reduces estimated physical-model cost by 78.71%. Across 2,896 live AMD ROCm requests, it preserves session continuity with 0 observed violations. The second number is what matters: a router that picks a cheap model 79% of the time is a router you cannot use for agents. A router that picks a cheap model 79% of the time *while keeping the session valid* is one you can use for everything.
SAAR wraps the existing decision pipeline in five new pieces: **router-owned session memory** (last physical model, matched decision, switch count, replay metadata); **hard locks** around active tool loops and non-portable provider-managed state; **reset boundaries** on idle timeout and decision drift; **switch economics** that price handoff cost, switch history, and prefix-cache checkout; and **replay traces** that record why the router stayed, switched, or refused to switch.
The hard locks are the part I would bet on. Cost rules apply only when the switch is safe. Almost every other router gets that ordering backwards.
The thing that lands in everyone's lap this week is the canonical v0.3 configuration contract.
version: v0.3
listeners: []
providers: {}
routing: {}
global: {}
Four top-level sections. Before Themis, the project had overlapping config shapes across local Docker, dashboard-generated config, Helm values, CRDs, examples, and older docs. Typos drifted silently. The new contract makes `config.yaml` the steady-state file everywhere, warns on unknown fields, and ships a `vllm-sr config migrate` path for old files. It is a breaking change, and the right kind for a pre-1.0 router: fewer dialects, clearer ownership, a more durable public contract.
The signal catalog is broad enough for real traffic: `authz`, `complexity`, `context`, `conversation`, `domain`, `embedding`, `event`, `fact_check`, `jailbreak`, `kb`, `keyword`, `language`, `modality`, `pii`, `preference`, `reask`, `structure`, `user_feedback`. The DSL added `SIGNAL_GROUP`, `TEST`, `TIER`, conflict detection, and a natural-language-to-DSL pipeline. Themis policies are reviewable routing programs with tests and retained `EMIT` outputs.
LiteLLM is a unified API gateway, not a router — fallback, retry, cost tracking, provider translation. Cheap model selection is a custom classifier you bolt on. OpenRouter is a hosted marketplace that picks a model per request and bills you. Great for prototyping, no router-owned session memory, no replay, no self-hosting. Martian, Not Diamond, and the hosted reasoning routers from the big labs all do prompt-level routing with a small amount of metadata. None of them treat the agent session as a first-class object. Themis is the first open-source router that does, and the first project to ship benchmarks for it.
If you are running vLLM, KServe, or any OpenAI-compatible serving stack on Kubernetes with more than one model in production, evaluate Themis this week. The Helm chart, CRDs, dashboard, and DSL all converge on the same canonical config. The install is one Helm release and one `vllm-sr config import` for provider inventories. AMD ROCm support is real. Replay traces turn "we routed a request" into "we routed a request, and here is the proof." If you only have one model or your traffic is a single-turn chatbot, SAAR will not pay for itself.
The right way to think about LLM routing in 2026 is not "which model for this prompt?" It is "which model for this session, and is it safe to switch right now?" Every prompt-level router I have seen in production is paying a hidden tax on agent traffic, and most teams do not realize it because the failures look like flaky model behavior, not routing bugs. Themis is the first open-source project that names the problem, designs the right abstraction, and ships the benchmarks. 4,000+ stars, 100 contributors, Apache 2.0, and a maintainer shipping on cadence since Iris in January. I would bet on Themis becoming the default routing layer for serious vLLM deployments by Q4 2026.
*Repo: github.com/vllm-project/semantic-router — Apache 2.0, 4K+ stars, 100 contributors, v0.3 Themis released June 5, 2026. Helm chart, CRDs, dashboard, DSL, and replay dashboard all in the main release. Install: `helm install vllm-sr vllm-semantic-router/vllm-sr`. Migrate older configs with `vllm-sr config migrate --config old-config.yaml`.*