Your chat endpoint blocks for 8 seconds, the user clicks the button three more times, your upstream bill doubles, and you still get three identical answers. Streaming with Server-Sent Events fixes all of it in 30 minutes. Here is the FastAPI build that actually works in production, with the three gotchas that are not in the docs.

Streaming LLM Responses with FastAPI and Server-Sent Events: The 30-Minute Build

Hey guys, Mr. Technology here.

The single most expensive mistake in a fresh LLM backend is a blocking /chat endpoint. Your user types a prompt, hits send, then stares at a spinner for 8 seconds while GPT-4o generates. They click the button again. Twice. Your upstream tokens triple. You ship three identical responses and your bill goes through the roof. The fix is streaming with Server-Sent Events, and it takes 30 minutes. Most teams put it off for "later." The teams that ship it on day one save real money and ship a better product.

The Setup

bash

mkdir llm-stream && cd llm-stream
python -m venv .venv && source .venv/bin/activate
pip install fastapi uvicorn openai sse-starlette httpx

You will need sse-starlette — FastAPI does not ship a built-in SSE helper, and rolling your own with StreamingResponse is how you spend a weekend debugging chunked transfer encoding.

The Endpoint

python

from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
from openai import OpenAI
from sse_starlette.sse import EventSourceResponse
import asyncio, json, os
app = FastAPI()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
@app.post("/chat/stream")
async def chat_stream(request: Request, body: dict):
    async def event_generator():
        try:
            stream = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=body["messages"],
                stream=True,
            )
            for chunk in stream:
                if await request.is_disconnected():
                    break
                delta = chunk.choices[0].delta.content or ""
                if delta:
                    yield {"event": "token", "data": json.dumps({"token": delta})}
            yield {"event": "done", "data": "[DONE]"}
        except Exception as e:
            yield {"event": "error", "data": json.dumps({"message": str(e)})}
    return EventSourceResponse(event_generator())

Run it:

bash

uvicorn app:app --reload --port 8000

Test it from the terminal with curl:

bash

curl -N -X POST http://localhost:8000/chat/stream \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Say hi in 5 words"}]}'

You should see tokens arriving one by one, ~50ms apart, with event: token lines. That is SSE — plain HTTP, every proxy on the planet speaks it.

The Three Gotchas

Gotcha 1: The OpenAI client is sync, and that sync call blocks your event loop. Every tutorial on the internet calls client.chat.completions.create(stream=True) inside an async def and pretends it streams. It does not — the underlying HTTP call still blocks the worker until the first chunk arrives, which is why you get a 1-3 second "dead air" before tokens start flowing. Wrap stream consumption in asyncio.to_thread() and yield from an async iterator. The fix is 6 lines and drops your time-to-first-token by half:

python

import asyncio
async def stream_openai(messages):
    loop = asyncio.get_event_loop()
    queue: asyncio.Queue = asyncio.Queue()
    def pump():
        try:
            for chunk in client.chat.completions.create(
                model="gpt-4o-mini", messages=messages, stream=True
            ):
                d = chunk.choices[0].delta.content
                if d: loop.call_soon_threadsafe(queue.put_nowait, d)
        finally:
            loop.call_soon_threadsafe(queue.put_nowait, None)
    asyncio.create_task(asyncio.to_thread(pump))
    while True:
        tok = await queue.get()
        if tok is None: return
        yield tok

Thread feeds a queue; async generator drains it. No blocking, no buffering the whole response.

**Gotcha 2: You must check request.is_disconnected() on every chunk, or you pay for tokens nobody reads.** I learned this the hard way watching a bill double after a frontend bug caused users to navigate away mid-stream. The server kept generating because nothing told it to stop. That is_disconnected() check is the line that saves you thousands a month at modest scale. Pair it with a client-side AbortController so the browser actually closes the connection.

Gotcha 3: Proxies buffer SSE unless you set the right headers. Nginx by default buffers responses until 4KB accumulates — your user sees nothing for 3 seconds, then a wall of tokens. sse-starlette sets Cache-Control: no-cache, X-Accel-Buffering: no, and Content-Type: text/event-stream for you, but Cloudflare and corporate proxies need X-Accel-Buffering: no set manually. Verify in response headers before you blame your code.

The Frontend

javascript

const res = await fetch('/chat/stream', {
  method: 'POST',
  headers: {'Content-Type': 'application/json'},
  body: JSON.stringify({messages: [{role:'user', content: prompt}]}),
});
const reader = res.body.getReader();
const decoder = new TextDecoder();
while (true) {
  const {done, value} = await reader.read();
  if (done) break;
  const text = decoder.decode(value);
  for (const line of text.split('\n')) {
    if (line.startsWith('data: ')) {
      const token = JSON.parse(line.slice(6)).token;
      if (token) appendToChat(token);
    }
  }
}

Twenty lines of vanilla JS. No SSE library needed. Works in every browser since 2018.

The Take

Streaming is not a polish feature you ship in v2. It is table stakes. The 30-minute investment here saves you a support ticket per week per user, a meaningful chunk of your LLM bill from abandoned requests, and the slow embarrassment of "why is my chat app slower than ChatGPT." Build it on day one.

— Mr. Technology

*Packages: fastapi, uvicorn, openai, sse-starlette. SSE = Server-Sent Events, a one-way HTTP stream from server to browser over a single long-lived text/event-stream connection. Time-to-first-token on gpt-4o-mini typically drops from 1.5-3s (blocking) to 250-400ms (streaming). For Anthropic, swap the OpenAI client for anthropic and iterate over client.messages.stream(...) the same way. Always pair server-side with a client-side AbortController to actually close the connection on navigation.*