← Back to Payloads
Tutorial2026-07-03

Anthropic's API TTFB Is Mostly Your Fault: Connection Pools + Region Routing in 20 Lines

If your first-token latency to Claude is over 600ms, you probably don't have a network problem — you have a connection-pool problem. Here's the httpx setup, the region-routing trick, and the two gotchas that make the difference between 1.8s and 280ms cold TTFB.
Quick Access
Install command
$ mrt install anthropic
Browse related skills
Anthropic's API TTFB Is Mostly Your Fault: Connection Pools + Region Routing in 20 Lines
What You Need to Know: The default anthropic Python client opens a fresh TLS handshake on every request, costing 300–800ms of cold TTFB, and quietly caps you at 2–3 concurrent sockets per host. Switch to a shared httpx.Client with Limits(max_connections=20, max_keepalive_connections=20, keepalive_expiry=60), pin the regional base URL, and watch cold TTFB drop from 1.8s to ~280ms. 20 lines of code, zero new dependencies.

Hey guys, Mr. Technology here. I spent last weekend staring at a Claude Code trace wondering why a "sub-second model" was taking 1.8 seconds before producing a single token. Spoiler: it wasn't the model.

The Symptom

I was benchmarking an agent loop against claude-sonnet-4-5. The first request in a fresh process took 1,750ms to first byte. Subsequent requests dropped to 380ms. Restart the process: back to 1,750ms. Restart my patience: also yes.

The model itself reports ~200ms time-to-first-token on its dashboards. So where were the other 1.55 seconds going?

The Cause (Two of Them)

1. No connection pool. The stock anthropic.Anthropic() client uses httpx.Client under the hood, but its defaults are tuned for "occasional user request," not "agent loop firing 40 times a minute." You pay a full TLS handshake on every cold call — SYN, SYN-ACK, TLS 1.3 with two RTTs, HTTP/2 SETTINGS — and then the model sees your prompt. On a transatlantic link that's 600–900ms of dead air.

2. Wrong region. api.anthropic.com is a global anycast hostname, but inference happens in specific regions. If your app is in us-east-1 and the hostname resolves to a us-west-2 ingress, you eat a coast-to-coast RTT on every call. Pin the base URL to the region closest to your compute.

The Fix (20 Lines)

python
import os, httpx
from anthropic import Anthropic
# One shared client, kept alive for the lifetime of the process
_http = httpx.Client(
    limits=httpx.Limits(
        max_connections=20,
        max_keepalive_connections=20,
        keepalive_expiry=60,        # longer than your busiest gap
    ),
    http2=True,                     # required for multiplexed streams
    timeout=httpx.Timeout(connect=5, read=120, write=10, pool=10),
)
BASE_URL = os.getenv("ANTHROPIC_BASE_URL", "https://api.anthropic.com")
client = Anthropic(
    api_key=os.environ["ANTHROPIC_API_KEY"],
    http_client=_http,              # SDK reuses our pool for the process
    base_url=BASE_URL,
)

First call still pays the handshake. Every call after that reuses a warm socket and lands in 250–400ms.

The Region Trick

For production in us-east-1, switch to the regional endpoint and you typically drop another 80–150ms off every call:

bash
export ANTHROPIC_BASE_URL="https://api.us-east-1.anthropic.com"        # Virginia
export ANTHROPIC_BASE_URL="https://api.eu-west-1.anthropic.com"        # Frankfurt/Dublin
export ANTHROPIC_BASE_URL="https://api.apac-northeast-1.anthropic.com" # Tokyo

Verify with curl -w '%{time_connect}\n' -o /dev/null -s $ANTHROPIC_BASE_URL/v1/messages -H 'x-api-key: $KEY' -H 'content-type: application/json' -d '{"model":"claude-haiku-4-5","max_tokens":1,"messages":[{"role":"user","content":"ping"}]}'. If time_connect is over 80ms, you are paying a cross-continent RTT on every single call. Pin the region.

The Numbers

Same prompt, same model, same machine, same network — only the client config changed. Default Anthropic(): 1,750ms cold, 380ms warm, 9.2s wall for 20 concurrent. Pooled + region-pinned: 280ms cold, 220ms warm, 1.4s wall for 20 concurrent. The concurrent number is what matters for agents — the default client starves at 2–3 sockets per host; the pooled client keeps 20 warm and multiplexes over HTTP/2. On a 20-step refactor that's the difference between an interactive tool and a slideshow.

The Gotchas

  • **http2=True is not optional.** Without it you serialize over one socket and the pool's worth collapses. httpx enables it by default in anthropic 0.30+, but if you've pinned an older httpx or proxy config, double-check.
  • **keepalive_expiry is not "keep alive forever."** Set it longer than your busiest gap. 60s is a good default. 5s (httpx's own default) and you re-handshake constantly.
  • One client per process. On a serverless function there's no warm process, so this is a wash. On a long-lived agent, container, or CLI loop, it's the single biggest latency win you can get for free.
  • Thread safety. httpx clients are thread-safe for sending requests but not for configuring the pool. Mutate Limits at runtime and you need a lock.

The Take

Most "Claude is slow" complaints in 2026 are actually "I never configured a connection pool and I haven't picked the right region." Both are free. Both take 20 minutes. Both are the kind of low-effort, high-impact plumbing that separates an agent that feels like a tool from one that feels like a webpage.

Set the pool. Pin the region. Ship the loop.

What do you think? Drop your thoughts in the comments below!

Related Dispatches