
You have three models in your stack. Claude for production chat. GPT-5 for the eval step. Ollama for local dev. Three SDKs. Three error formats. Three rate-limit headers. Every team I have worked with in 2026 hits the same wall: the model layer is now a multi-vendor tax, paid in onboarding time and the silent cost of an engineer who gives up and hard-codes the SDK that works.
Hi guys, Mr. Technology here.
The fix is LiteLLM — BerriAI's OpenAI-compatible proxy that sits in front of every model you call, speaks the OpenAI wire format, and routes to 100+ providers under one URL. Five minutes of setup gives you a single /v1/chat/completions endpoint that handles Claude, GPT, Gemini, Bedrock, Vertex, Mistral, Ollama, vLLM, and any local server. Your app talks OpenAI. LiteLLM translates.
bash pip install 'litellm[proxy]'
Drop a YAML at the root of your repo:
```yaml model_list:
litellm_params: model: claude-sonnet-4-6 api_key: os.environ/ANTHROPIC_API_KEY
litellm_params: model: gpt-5 api_key: os.environ/OPENAI_API_KEY
litellm_params: model: ollama/qwen2.5-coder:32b api_base: http://localhost:11434 ```
Spin it up:
bash litellm --config litellm_config.yaml --port 4000
Your entire org hits http://localhost:4000/v1/chat/completions with the OpenAI SDK, picks model="claude-sonnet", and gets a response. Same code in prod, staging, eval, and your laptop's dev loop.
One SDK, one error format. Every provider's quirks — Anthropic's prompt caching headers, Gemini's safety blocks, Bedrock's sigv4, Ollama's missing system role — get normalized at the proxy. Your app catches openai.APIError and openai.RateLimitError the same way for every upstream.
Cost tracking is built in. LiteLLM ships a virtual-key system where you issue per-team or per-engineer keys, set monthly USD budgets, and watch spend land in Postgres or SQLite. Every request logs model, prompt_tokens, completion_tokens, cost_usd, user, team. When finance asks what the eval pipeline cost last month, you answer in ten seconds with SELECT team, SUM(cost_usd) FROM litellm_logs GROUP BY team.
Fallbacks and budgets. Add litellm_params.fallbacks: [gpt-5] to the Claude model and a 429 from Anthropic transparently retries on GPT-5. Add rpm: 100 and tpm: 500000 per virtual key and a runaway script cannot burn the budget. Add timeout: 30 once and your fleet stops hanging on a dead Ollama.
Streaming, function calling, JSON mode, vision — LiteLLM translates every OpenAI feature to the upstream's equivalent. The application code does not change when you swap backends.
Three rules make this stick past the demo:
1. One config file, version-controlled. The YAML lives next to your docker-compose.yml. New models are a PR. "Who added the new key?" is git log litellm_config.yaml.
2. Virtual keys for humans, real keys for CI. Engineers get sk-litellm-<name> with a $200/month budget. CI gets a service-account key with a 10x budget and an auditable tag.
3. Run the proxy in Docker, not on your laptop. docker run -p 4000:4000 ghcr.io/berriai/litellm:main is the same command in dev, CI, and prod. Your laptop is a client, not a server.
If you ship one model to one customer and never plan to add another, you do not need a proxy. The moment you have two backends — even Claude prod plus Ollama dev — the proxy pays for itself in onboarding time alone.
LiteLLM is the boring infrastructure move that turns a multi-vendor LLM stack from a coordination problem into a config file. Spend the five minutes. Stop hand-rolling provider adapters.
— Mr. Technology
*LiteLLM 1.50+ (June 2026), Apache 2.0, github.com/BerriAI/litellm. Supports 100+ providers including OpenAI, Anthropic, Google, AWS Bedrock, Vertex, Azure, Mistral, Groq, Together, Fireworks, Ollama, vLLM, and any OpenAI-compatible endpoint. Built-in spend tracking via Postgres/SQLite/Prisma, virtual keys, per-key budgets, fallbacks, retries, streaming, function calling, JSON mode, vision. UI on :4000/ui when LITELLM_UI=1.*