
What You Need to Know: The defaultanthropicPython client opens a fresh TLS handshake on every request, costing 300–800ms of cold TTFB, and quietly caps you at 2–3 concurrent sockets per host. Switch to a sharedhttpx.ClientwithLimits(max_connections=20, max_keepalive_connections=20, keepalive_expiry=60), pin the regional base URL, and watch cold TTFB drop from 1.8s to ~280ms. 20 lines of code, zero new dependencies.
Hey guys, Mr. Technology here. I spent last weekend staring at a Claude Code trace wondering why a "sub-second model" was taking 1.8 seconds before producing a single token. Spoiler: it wasn't the model.
I was benchmarking an agent loop against claude-sonnet-4-5. The first request in a fresh process took 1,750ms to first byte. Subsequent requests dropped to 380ms. Restart the process: back to 1,750ms. Restart my patience: also yes.
The model itself reports ~200ms time-to-first-token on its dashboards. So where were the other 1.55 seconds going?
1. No connection pool. The stock anthropic.Anthropic() client uses httpx.Client under the hood, but its defaults are tuned for "occasional user request," not "agent loop firing 40 times a minute." You pay a full TLS handshake on every cold call — SYN, SYN-ACK, TLS 1.3 with two RTTs, HTTP/2 SETTINGS — and then the model sees your prompt. On a transatlantic link that's 600–900ms of dead air.
2. Wrong region. api.anthropic.com is a global anycast hostname, but inference happens in specific regions. If your app is in us-east-1 and the hostname resolves to a us-west-2 ingress, you eat a coast-to-coast RTT on every call. Pin the base URL to the region closest to your compute.
import os, httpx
from anthropic import Anthropic
# One shared client, kept alive for the lifetime of the process
_http = httpx.Client(
limits=httpx.Limits(
max_connections=20,
max_keepalive_connections=20,
keepalive_expiry=60, # longer than your busiest gap
),
http2=True, # required for multiplexed streams
timeout=httpx.Timeout(connect=5, read=120, write=10, pool=10),
)
BASE_URL = os.getenv("ANTHROPIC_BASE_URL", "https://api.anthropic.com")
client = Anthropic(
api_key=os.environ["ANTHROPIC_API_KEY"],
http_client=_http, # SDK reuses our pool for the process
base_url=BASE_URL,
)First call still pays the handshake. Every call after that reuses a warm socket and lands in 250–400ms.
For production in us-east-1, switch to the regional endpoint and you typically drop another 80–150ms off every call:
export ANTHROPIC_BASE_URL="https://api.us-east-1.anthropic.com" # Virginia export ANTHROPIC_BASE_URL="https://api.eu-west-1.anthropic.com" # Frankfurt/Dublin export ANTHROPIC_BASE_URL="https://api.apac-northeast-1.anthropic.com" # Tokyo
Verify with curl -w '%{time_connect}\n' -o /dev/null -s $ANTHROPIC_BASE_URL/v1/messages -H 'x-api-key: $KEY' -H 'content-type: application/json' -d '{"model":"claude-haiku-4-5","max_tokens":1,"messages":[{"role":"user","content":"ping"}]}'. If time_connect is over 80ms, you are paying a cross-continent RTT on every single call. Pin the region.
Same prompt, same model, same machine, same network — only the client config changed. Default Anthropic(): 1,750ms cold, 380ms warm, 9.2s wall for 20 concurrent. Pooled + region-pinned: 280ms cold, 220ms warm, 1.4s wall for 20 concurrent. The concurrent number is what matters for agents — the default client starves at 2–3 sockets per host; the pooled client keeps 20 warm and multiplexes over HTTP/2. On a 20-step refactor that's the difference between an interactive tool and a slideshow.
http2=True is not optional.** Without it you serialize over one socket and the pool's worth collapses. httpx enables it by default in anthropic 0.30+, but if you've pinned an older httpx or proxy config, double-check.keepalive_expiry is not "keep alive forever."** Set it longer than your busiest gap. 60s is a good default. 5s (httpx's own default) and you re-handshake constantly.Limits at runtime and you need a lock.Most "Claude is slow" complaints in 2026 are actually "I never configured a connection pool and I haven't picked the right region." Both are free. Both take 20 minutes. Both are the kind of low-effort, high-impact plumbing that separates an agent that feels like a tool from one that feels like a webpage.
Set the pool. Pin the region. Ship the loop.
What do you think? Drop your thoughts in the comments below!