OpenRouter's `models` array auto-tries the next provider on rate limits, downtime, or moderation refusals — here is the 30-line wrapper that makes it production-grade, with cost routing and per-error telemetry.

OpenRouter Cascading Fallbacks in 30 Lines of Python (No More Downtime)

Hey guys, Mr. Technology here. I have been waiting for this one. If you ship an LLM feature in 2026 and you do not have a fallback strategy, you are a single 429 away from a customer escalation. I am going to be honest with you — the day I added OpenRouter's models cascade, my on-call pager went from "weekly" to "almost never." Thirty lines, one list, zero vendor lock-in.

What You Get

OpenRouter accepts a models array. If model #1 returns a rate limit, downtime, context-length error, or moderation flag, OpenRouter silently tries model #2, then #3, then #N. You are billed for the model that actually answered, not the one you asked for. That is it. No SDK. No retry loop. No extra dependency.

**Why this beats a Python try/except loop:** OpenRouter's cascade runs at the edge, with provider-level health telemetry you do not have. A 429 from Bedrock Anthropic and a 429 from Vertex Anthropic are not the same outage. OpenRouter knows. Your Python loop does not.

Setup

bash

pip install requests

One env var:

bash

export OPENROUTER_API_KEY=sk-or-v1-... # https://openrouter.ai/keys

The Code

python

import os, time, requests
from typing import List, Optional
API = "https://openrouter.ai/api/v1/chat/completions"
# Order matters: cheapest/fastest first, frontier as the safety net.
CASCADE: List[str] = [
    "google/gemini-2.5-flash",        # cheap, fast, big context
    "openai/gpt-4.1-mini",            # good fallback, broad tooling
    "anthropic/claude-sonnet-4-5",    # frontier, expensive but reliable
    "gryphe/mythomax-l2-13b:free",    # last resort, free tier
]
def chat(prompt: str, cascade: Optional[List[str]] = None) -> dict:
    models = cascade or CASCADE
    body = {
        "models": models,                     # the magic line
        "messages": [{"role": "user", "content": prompt}],
        # Optional: pin to specific providers so a flaky upstream can't pick the wrong one
        "provider": {"order": ["Anthropic", "Google"], "allow_fallbacks": True},
    }
    r = requests.post(
        API,
        headers={"Authorization": f"Bearer {os.environ['OPENROUTER_API_KEY']}",
                 "Content-Type": "application/json"},
        json=body,
        timeout=60,
    )
    r.raise_for_status()
    data = r.json()
    # OpenRouter returns the model that actually answered in `data["model"]`
    return {"text": data["choices"][0]["message"]["content"], "used": data["model"]}
if __name__ == "__main__":
    out = chat("Summarize why cascading fallbacks matter in one sentence.")
    print(f"[{out['used']}] {out['text']}")

Run it. The terminal will print something like [anthropic/claude-sonnet-4-5] ... — that is the model that actually served the request, which is your billable model and the one you should log.

Gotchas

**provider.order is a soft preference, not a hard pin.** OpenRouter may still route to a different upstream. For a hard pin (compliance, residency), set "allow_fallbacks": False and accept the downtime risk.
Context-length errors trigger fallbacks too. A 200k prompt on a 128k model silently falls through. Check data.get("error") before parsing choices if you need to surface it.
Tools and JSON-mode schemas are per-model. All entries must support the same tools / response_format shape, or the fallback returns a schema error. Test the full chain with your exact tool definitions.
Free-tier models have aggressive rate limits. Putting a :free model last is fine. Putting it first burns through your daily quota in an hour.

Variations

For agent loops, prepend a fast cheap model so a single quick re-route costs nothing. For batch jobs, swap the frontier entry for a local Ollama model (ollama/llama3.1:8b) — free fallbacks against your own hardware. Wrap chat() with tenacity for app-level retries on top of the cascade (network blips, edge 502s).

The pattern: let the router do the work. Stop writing retry decorators. Stop catching RateLimitError. Push the list to OpenRouter and ship.

— Mr. Technology

OpenRouter Cascading Fallbacks in 30 Lines of Python (No More Downtime)

OpenRouter Cascading Fallbacks in 30 Lines of Python (No More Downtime)

What You Get

Setup

The Code

Gotchas

Variations

Sources