
200ms median first-token latency, 9-second p99. Users blame the model. The model is fine. The streaming plumbing is buffering tokens it should be flushing. Four knobs fix it — no model server changes.
Hey guys, Mr. Technology here.
Three buffers sit between your model and the user, all defaulting to wait until I have a chunk worth shipping:
1. Reverse proxy. NGINX, Envoy, ALB buffer 4-16 KB before flushing. 2. Runtime group commit. asyncio and Node batch writes when multiple sockets are ready. 5ms batch × 200 connections = 5s p99. 3. KV scheduler. vLLM, TGI, SGLang buffer 8-32 tokens before flushing.
Defeating (1) and (2) takes an afternoon. (3) needs a model server change.
NGINX, the most common culprit:
location /v1/stream {
proxy_pass http://upstream;
proxy_http_version 1.1;
proxy_buffering off;
proxy_cache off;
proxy_set_header X-Accel-Buffering no;
add_header X-Accel-Buffering no;
chunked_transfer_encoding on;
}X-Accel-Buffering: no is the universal opt-out. Emit it from your app even when you do not know what is in front.
import asyncio
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
app = FastAPI()
async def stream_tokens(prompt: str):
async for token in model.stream(prompt):
yield f"data: {token}\n\n".encode("utf-8")
# The crucial line. Without it, asyncio buffers 64 KB.
await asyncio.sleep(0)
@app.post("/v1/stream")
async def stream(prompt: str):
return StreamingResponse(
stream_tokens(prompt),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"X-Accel-Buffering": "no",
"Connection": "keep-alive",
},
)asyncio.sleep(0) returns control to the event loop and forces the chunk onto the socket — one SSE event per token.
from openai import OpenAI
client = OpenAI(base_url="https://api.openai.com/v1")
stream = client.chat.completions.create(
model="gpt-4o-mini",
stream=True,
messages=[{"role": "user", "content": "Tell me a short story"}],
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)stream=True bypasses the SDK's response-collection buffer. Add flush=True on print when debugging.
Idle SSE connections get dropped at 60s by corporate proxies, load balancers, and NAT gateways. A heartbeat keeps the socket warm without firing user-visible events:
async def stream_tokens(prompt: str):
loop = asyncio.get_event_loop()
yield b": ping\n\n" # SSE comment, ignored by clients
last = loop.time()
async for token in model.stream(prompt):
now = loop.time()
if now - last > 15:
yield b": ping\n\n"
last = now
yield f"data: {token}\n\n".encode()
await asyncio.sleep(0)The leading : is an SSE comment line. Clients ignore it; the proxy keeps the socket open.
hey -n 1000 -c 50 -T 30s -m POST \
-H "Content-Type: application/json" \
-d '{"prompt":"hi","stream":true}' \
http://localhost:8000/v1/streamPre-fix p99 lands 20-50x the median on badly configured stacks. Post-fix, expect p99 within 3-5x. The win: two headers and one asyncio.sleep(0).
Streaming LLM tail latency lives in the plumbing, not the model. Disabling proxy buffering, flushing per token, keeping SSE warm, and using the streaming client API drops p99 by 5-10x on a stock endpoint. If users complain the model is slow, check the buffers first. The model is the easiest thing to blame and the least likely to be the bottleneck.
— Mr. Technology
*Tested July 2026 on FastAPI 0.115+, NGINX 1.27+, openai-python 1.50+. Run all four knobs — they stack. SSE heartbeats should be 15-20s; shorter wastes bandwidth, longer risks proxy drops. X-Accel-Buffering: no is honored by NGINX, Cloudflare, and Fastly; verify Akamai and Azure CDN edge behaviors before deploying there.*