
Hey guys, Mr. Technology here. I have been waiting for this one. If you ship an LLM feature in 2026 and you do not have a fallback strategy, you are a single 429 away from a customer escalation. I am going to be honest with you — the day I added OpenRouter's models cascade, my on-call pager went from "weekly" to "almost never." Thirty lines, one list, zero vendor lock-in.
OpenRouter accepts a models array. If model #1 returns a rate limit, downtime, context-length error, or moderation flag, OpenRouter silently tries model #2, then #3, then #N. You are billed for the model that actually answered, not the one you asked for. That is it. No SDK. No retry loop. No extra dependency.
**Why this beats a Python try/except loop:** OpenRouter's cascade runs at the edge, with provider-level health telemetry you do not have. A 429 from Bedrock Anthropic and a 429 from Vertex Anthropic are not the same outage. OpenRouter knows. Your Python loop does not.
bash pip install requests
One env var:
bash export OPENROUTER_API_KEY=sk-or-v1-... # https://openrouter.ai/keys
```python import os, time, requests from typing import List, Optional
API = "https://openrouter.ai/api/v1/chat/completions"
CASCADE: List[str] = [ "google/gemini-2.5-flash", # cheap, fast, big context "openai/gpt-4.1-mini", # good fallback, broad tooling "anthropic/claude-sonnet-4-5", # frontier, expensive but reliable "gryphe/mythomax-l2-13b:free", # last resort, free tier ]
def chat(prompt: str, cascade: Optional[List[str]] = None) -> dict: models = cascade or CASCADE body = { "models": models, # the magic line "messages": [{"role": "user", "content": prompt}],
"provider": {"order": ["Anthropic", "Google"], "allow_fallbacks": True}, } r = requests.post( API, headers={"Authorization": f"Bearer {os.environ['OPENROUTER_API_KEY']}", "Content-Type": "application/json"}, json=body, timeout=60, ) r.raise_for_status() data = r.json()
data["model"]return {"text": data["choices"][0]["message"]["content"], "used": data["model"]}
if __name__ == "__main__": out = chat("Summarize why cascading fallbacks matter in one sentence.") print(f"[{out['used']}] {out['text']}") ```
Run it. The terminal will print something like [anthropic/claude-sonnet-4-5] ... — that is the model that actually served the request, which is your billable model and the one you should log.
provider.order is a soft preference, not a hard pin.** OpenRouter may still route to a different upstream. For a hard pin (compliance, residency), set "allow_fallbacks": False and accept the downtime risk.data.get("error") before parsing choices if you need to surface it.tools / response_format shape, or the fallback returns a schema error. Test the full chain with your exact tool definitions.:free model last is fine. Putting it first burns through your daily quota in an hour.For agent loops, prepend a fast cheap model so a single quick re-route costs nothing. For batch jobs, swap the frontier entry for a local Ollama model (ollama/llama3.1:8b) — free fallbacks against your own hardware. Wrap chat() with tenacity for app-level retries on top of the cascade (network blips, edge 502s).
The pattern: let the router do the work. Stop writing retry decorators. Stop catching RateLimitError. Push the list to OpenRouter and ship.
— Mr. Technology