
Hey guys, Mr. Technology here.
The single most expensive mistake in a fresh LLM backend is a blocking /chat endpoint. Your user types a prompt, hits send, then stares at a spinner for 8 seconds while GPT-4o generates. They click the button again. Twice. Your upstream tokens triple. You ship three identical responses and your bill goes through the roof. The fix is streaming with Server-Sent Events, and it takes 30 minutes. Most teams put it off for "later." The teams that ship it on day one save real money and ship a better product.
bash mkdir llm-stream && cd llm-stream python -m venv .venv && source .venv/bin/activate pip install fastapi uvicorn openai sse-starlette httpx
You will need sse-starlette — FastAPI does not ship a built-in SSE helper, and rolling your own with StreamingResponse is how you spend a weekend debugging chunked transfer encoding.
```python from fastapi import FastAPI, Request from fastapi.responses import JSONResponse from openai import OpenAI from sse_starlette.sse import EventSourceResponse import asyncio, json, os
app = FastAPI() client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
@app.post("/chat/stream") async def chat_stream(request: Request, body: dict): async def event_generator(): try: stream = client.chat.completions.create( model="gpt-4o-mini", messages=body["messages"], stream=True, ) for chunk in stream: if await request.is_disconnected(): break delta = chunk.choices[0].delta.content or "" if delta: yield {"event": "token", "data": json.dumps({"token": delta})} yield {"event": "done", "data": "[DONE]"} except Exception as e: yield {"event": "error", "data": json.dumps({"message": str(e)})}
return EventSourceResponse(event_generator()) ```
Run it:
bash uvicorn app:app --reload --port 8000
Test it from the terminal with curl:
bash curl -N -X POST http://localhost:8000/chat/stream \ -H "Content-Type: application/json" \ -d '{"messages":[{"role":"user","content":"Say hi in 5 words"}]}'
You should see tokens arriving one by one, ~50ms apart, with event: token lines. That is SSE — plain HTTP, every proxy on the planet speaks it.
Gotcha 1: The OpenAI client is sync, and that sync call blocks your event loop. Every tutorial on the internet calls client.chat.completions.create(stream=True) inside an async def and pretends it streams. It does not — the underlying HTTP call still blocks the worker until the first chunk arrives, which is why you get a 1-3 second "dead air" before tokens start flowing. Wrap stream consumption in asyncio.to_thread() and yield from an async iterator. The fix is 6 lines and drops your time-to-first-token by half:
python import asyncio async def stream_openai(messages): loop = asyncio.get_event_loop() queue: asyncio.Queue = asyncio.Queue() def pump(): try: for chunk in client.chat.completions.create( model="gpt-4o-mini", messages=messages, stream=True ): d = chunk.choices[0].delta.content if d: loop.call_soon_threadsafe(queue.put_nowait, d) finally: loop.call_soon_threadsafe(queue.put_nowait, None) asyncio.create_task(asyncio.to_thread(pump)) while True: tok = await queue.get() if tok is None: return yield tok
Thread feeds a queue; async generator drains it. No blocking, no buffering the whole response.
**Gotcha 2: You must check request.is_disconnected() on every chunk, or you pay for tokens nobody reads.** I learned this the hard way watching a bill double after a frontend bug caused users to navigate away mid-stream. The server kept generating because nothing told it to stop. That is_disconnected() check is the line that saves you thousands a month at modest scale. Pair it with a client-side AbortController so the browser actually closes the connection.
Gotcha 3: Proxies buffer SSE unless you set the right headers. Nginx by default buffers responses until 4KB accumulates — your user sees nothing for 3 seconds, then a wall of tokens. sse-starlette sets Cache-Control: no-cache, X-Accel-Buffering: no, and Content-Type: text/event-stream for you, but Cloudflare and corporate proxies need X-Accel-Buffering: no set manually. Verify in response headers before you blame your code.
javascript const res = await fetch('/chat/stream', { method: 'POST', headers: {'Content-Type': 'application/json'}, body: JSON.stringify({messages: [{role:'user', content: prompt}]}), }); const reader = res.body.getReader(); const decoder = new TextDecoder(); while (true) { const {done, value} = await reader.read(); if (done) break; const text = decoder.decode(value); for (const line of text.split('\n')) { if (line.startsWith('data: ')) { const token = JSON.parse(line.slice(6)).token; if (token) appendToChat(token); } } }
Twenty lines of vanilla JS. No SSE library needed. Works in every browser since 2018.
Streaming is not a polish feature you ship in v2. It is table stakes. The 30-minute investment here saves you a support ticket per week per user, a meaningful chunk of your LLM bill from abandoned requests, and the slow embarrassment of "why is my chat app slower than ChatGPT." Build it on day one.
— Mr. Technology
*Packages: fastapi, uvicorn, openai, sse-starlette. SSE = Server-Sent Events, a one-way HTTP stream from server to browser over a single long-lived text/event-stream connection. Time-to-first-token on gpt-4o-mini typically drops from 1.5-3s (blocking) to 250-400ms (streaming). For Anthropic, swap the OpenAI client for anthropic and iterate over client.messages.stream(...) the same way. Always pair server-side with a client-side AbortController to actually close the connection on navigation.*