Every LLM-backed service eventually hits a wall of 429s, cost spikes, or one client hogging capacity. Here is a 40-line async token bucket pattern that handles bursts, isolates per-client traffic, and slots cleanly into a FastAPI handler — with concrete sizing numbers for OpenAI, Anthropic, and OpenRouter.

Rate-Limit Your LLM Endpoint With a Token Bucket in 40 Lines

Every LLM-backed service eventually hits the same wall: your upstream returns 429s, your costs spike, or one misbehaving client hogs capacity. The fix is not a queueing system or a sidecar proxy. It is a token bucket in front of the endpoint, sized to your provider's actual limits and your tolerance for bursts.

Here is the pattern. Forty lines of Python, no dependencies beyond asyncio, and you get per-key rate limiting that handles bursts, recovers gracefully on 429s, and never blocks your event loop.

The Code

python

import asyncio
import time
class TokenBucket:
    def __init__(self, capacity: float, refill_rate: float):
        self.capacity = capacity
        self.refill_rate = refill_rate
        self.tokens = capacity
        self.last_refill = time.monotonic()
        self.lock = asyncio.Lock()
    async def acquire(self, cost: float = 1.0):
        while True:
            async with self.lock:
                now = time.monotonic()
                elapsed = now - self.last_refill
                self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
                self.last_refill = now
                if self.tokens >= cost:
                    self.tokens -= cost
                    return
                deficit = cost - self.tokens
                wait = deficit / self.refill_rate
            # sleep OUTSIDE the lock so other coroutines can proceed
            await asyncio.sleep(wait)
buckets: dict[str, TokenBucket] = {}
def get_bucket(key: str, capacity: float = 10, refill_rate: float = 2.0) -> TokenBucket:
    if key not in buckets:
        buckets[key] = TokenBucket(capacity=capacity, refill_rate=refill_rate)
    return buckets[key]

Then in your FastAPI handler:

python

@app.post("/chat")
async def chat(req: ChatRequest, user=Depends(auth)):
    bucket = get_bucket(f"user:{user.id}", capacity=10, refill_rate=2.0)
    await bucket.acquire()  # blocks until a token is available
    return await openai_client.chat.completions.create(
        model=req.model, messages=req.messages
    )

Why This Pattern Works

Burst capacity is real. Most providers allow short bursts above steady-state. OpenAI lets you burst above your TPM for a few seconds before throttling. The bucket's capacity parameter captures that headroom. Setting capacity=10, refill_rate=2.0 means "10 requests right now, then 2 per second sustained," which matches most tier-1 OpenAI limits.

Per-key isolation prevents one client from starving others. A noisy neighbor with 100 concurrent requests will drain their bucket and then wait, while other clients' buckets stay full. If you are protecting a shared upstream, key by api_key_id instead of user_id so a single heavy user cannot crush the whole tenant.

It is async-native. Unlike asyncio.Semaphore (no time component) or slowapi (request-rate-only, not token-based), this handles steady-state limits and burst behavior correctly. The lock is held only for the cheap arithmetic; the actual asyncio.sleep happens outside the lock so other coroutines proceed.

Sizing It Right

Pull your provider's documented limits and back off 20 percent:

OpenAI Tier 2: 200K TPM, 500 RPM → capacity=20, refill_rate=8
Anthropic Build 1: 100K TPM, 50 RPM → capacity=10, refill_rate=0.8
OpenRouter Free: 20K TPM, 20 RPM → capacity=5, refill_rate=0.3

Then locust your endpoint for ten minutes and watch the 429 rate. If you see more than 1 percent, drop refill_rate by 20 percent and rerun.

When To Skip This

If you are making fewer than ten LLM calls per second across all users, you do not need this. The provider's limits will rarely fire. This is for the moment your service goes from personal project to shared service — usually right around when the first user reports a 429 and the second one tweets about it.

Drop the snippet into a module, import it in your route, and ship it before the next traffic spike, not after.

Token bucket rate limiting is one of those things that looks like premature optimization until the day it isn't. The day is always sooner than you think.