← Back to Payloads
Tutorial2026-06-22

Rate-Limit Your LLM Endpoint With a Token Bucket in 40 Lines

Every LLM-backed service eventually hits a wall of 429s, cost spikes, or one client hogging capacity. Here is a 40-line async token bucket pattern that handles bursts, isolates per-client traffic, and slots cleanly into a FastAPI handler — with concrete sizing numbers for OpenAI, Anthropic, and OpenRouter.
Quick Access
Install command
$ mrt install tutorial
Browse related skills
Rate-Limit Your LLM Endpoint With a Token Bucket in 40 Lines

Rate-Limit Your LLM Endpoint With a Token Bucket in 40 Lines

Every LLM-backed service eventually hits the same wall: your upstream returns 429s, your costs spike, or one misbehaving client hogs capacity. The fix is not a queueing system or a sidecar proxy. It is a token bucket in front of the endpoint, sized to your provider's actual limits and your tolerance for bursts.

Here is the pattern. Forty lines of Python, no dependencies beyond asyncio, and you get per-key rate limiting that handles bursts, recovers gracefully on 429s, and never blocks your event loop.

The Code

```python import asyncio import time

class TokenBucket: def __init__(self, capacity: float, refill_rate: float): self.capacity = capacity self.refill_rate = refill_rate self.tokens = capacity self.last_refill = time.monotonic() self.lock = asyncio.Lock()

async def acquire(self, cost: float = 1.0): while True: async with self.lock: now = time.monotonic() elapsed = now - self.last_refill self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate) self.last_refill = now if self.tokens >= cost: self.tokens -= cost return deficit = cost - self.tokens wait = deficit / self.refill_rate

sleep OUTSIDE the lock so other coroutines can proceed

await asyncio.sleep(wait)

buckets: dict[str, TokenBucket] = {}

def get_bucket(key: str, capacity: float = 10, refill_rate: float = 2.0) -> TokenBucket: if key not in buckets: buckets[key] = TokenBucket(capacity=capacity, refill_rate=refill_rate) return buckets[key] ```

Then in your FastAPI handler:

python @app.post("/chat") async def chat(req: ChatRequest, user=Depends(auth)): bucket = get_bucket(f"user:{user.id}", capacity=10, refill_rate=2.0) await bucket.acquire() # blocks until a token is available return await openai_client.chat.completions.create( model=req.model, messages=req.messages )

Why This Pattern Works

Burst capacity is real. Most providers allow short bursts above steady-state. OpenAI lets you burst above your TPM for a few seconds before throttling. The bucket's capacity parameter captures that headroom. Setting capacity=10, refill_rate=2.0 means "10 requests right now, then 2 per second sustained," which matches most tier-1 OpenAI limits.

Per-key isolation prevents one client from starving others. A noisy neighbor with 100 concurrent requests will drain their bucket and then wait, while other clients' buckets stay full. If you are protecting a shared upstream, key by api_key_id instead of user_id so a single heavy user cannot crush the whole tenant.

It is async-native. Unlike asyncio.Semaphore (no time component) or slowapi (request-rate-only, not token-based), this handles steady-state limits and burst behavior correctly. The lock is held only for the cheap arithmetic; the actual asyncio.sleep happens outside the lock so other coroutines proceed.

Sizing It Right

Pull your provider's documented limits and back off 20 percent:

  • OpenAI Tier 2: 200K TPM, 500 RPM → capacity=20, refill_rate=8
  • Anthropic Build 1: 100K TPM, 50 RPM → capacity=10, refill_rate=0.8
  • OpenRouter Free: 20K TPM, 20 RPM → capacity=5, refill_rate=0.3

Then locust your endpoint for ten minutes and watch the 429 rate. If you see more than 1 percent, drop refill_rate by 20 percent and rerun.

When To Skip This

If you are making fewer than ten LLM calls per second across all users, you do not need this. The provider's limits will rarely fire. This is for the moment your service goes from personal project to shared service — usually right around when the first user reports a 429 and the second one tweets about it.

Drop the snippet into a module, import it in your route, and ship it before the next traffic spike, not after.


Token bucket rate limiting is one of those things that looks like premature optimization until the day it isn't. The day is always sooner than you think.

Related Dispatches