
Every LLM-backed service eventually hits the same wall: your upstream returns 429s, your costs spike, or one misbehaving client hogs capacity. The fix is not a queueing system or a sidecar proxy. It is a token bucket in front of the endpoint, sized to your provider's actual limits and your tolerance for bursts.
Here is the pattern. Forty lines of Python, no dependencies beyond asyncio, and you get per-key rate limiting that handles bursts, recovers gracefully on 429s, and never blocks your event loop.
```python import asyncio import time
class TokenBucket: def __init__(self, capacity: float, refill_rate: float): self.capacity = capacity self.refill_rate = refill_rate self.tokens = capacity self.last_refill = time.monotonic() self.lock = asyncio.Lock()
async def acquire(self, cost: float = 1.0): while True: async with self.lock: now = time.monotonic() elapsed = now - self.last_refill self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate) self.last_refill = now if self.tokens >= cost: self.tokens -= cost return deficit = cost - self.tokens wait = deficit / self.refill_rate
await asyncio.sleep(wait)
buckets: dict[str, TokenBucket] = {}
def get_bucket(key: str, capacity: float = 10, refill_rate: float = 2.0) -> TokenBucket: if key not in buckets: buckets[key] = TokenBucket(capacity=capacity, refill_rate=refill_rate) return buckets[key] ```
Then in your FastAPI handler:
python @app.post("/chat") async def chat(req: ChatRequest, user=Depends(auth)): bucket = get_bucket(f"user:{user.id}", capacity=10, refill_rate=2.0) await bucket.acquire() # blocks until a token is available return await openai_client.chat.completions.create( model=req.model, messages=req.messages )
Burst capacity is real. Most providers allow short bursts above steady-state. OpenAI lets you burst above your TPM for a few seconds before throttling. The bucket's capacity parameter captures that headroom. Setting capacity=10, refill_rate=2.0 means "10 requests right now, then 2 per second sustained," which matches most tier-1 OpenAI limits.
Per-key isolation prevents one client from starving others. A noisy neighbor with 100 concurrent requests will drain their bucket and then wait, while other clients' buckets stay full. If you are protecting a shared upstream, key by api_key_id instead of user_id so a single heavy user cannot crush the whole tenant.
It is async-native. Unlike asyncio.Semaphore (no time component) or slowapi (request-rate-only, not token-based), this handles steady-state limits and burst behavior correctly. The lock is held only for the cheap arithmetic; the actual asyncio.sleep happens outside the lock so other coroutines proceed.
Pull your provider's documented limits and back off 20 percent:
capacity=20, refill_rate=8capacity=10, refill_rate=0.8capacity=5, refill_rate=0.3Then locust your endpoint for ten minutes and watch the 429 rate. If you see more than 1 percent, drop refill_rate by 20 percent and rerun.
If you are making fewer than ten LLM calls per second across all users, you do not need this. The provider's limits will rarely fire. This is for the moment your service goes from personal project to shared service — usually right around when the first user reports a 429 and the second one tweets about it.
Drop the snippet into a module, import it in your route, and ship it before the next traffic spike, not after.
Token bucket rate limiting is one of those things that looks like premature optimization until the day it isn't. The day is always sooner than you think.