← Back to Payloads
2026-07-01

Cut Streaming LLM Tail Latency with SSE: The 4 Knobs That Move p99

Your streaming LLM endpoint has 200ms median first-token latency and a 9-second p99. The model is fine — your streaming plumbing is buffering tokens it should be flushing. Four knobs fix it without touching the model server.
Quick Access
Install command
$ mrt install tutorial
Browse related skills
Cut Streaming LLM Tail Latency with SSE: The 4 Knobs That Move p99

Cut Streaming LLM Tail Latency with SSE: The 4 Knobs That Move p99

200ms median first-token latency, 9-second p99. Users blame the model. The model is fine. The streaming plumbing is buffering tokens it should be flushing. Four knobs fix it — no model server changes.

Hey guys, Mr. Technology here.

Where the Latency Hides

Three buffers sit between your model and the user, all defaulting to wait until I have a chunk worth shipping:

1. Reverse proxy. NGINX, Envoy, ALB buffer 4-16 KB before flushing. 2. Runtime group commit. asyncio and Node batch writes when multiple sockets are ready. 5ms batch × 200 connections = 5s p99. 3. KV scheduler. vLLM, TGI, SGLang buffer 8-32 tokens before flushing.

Defeating (1) and (2) takes an afternoon. (3) needs a model server change.

The Recipe

Knob 1 — Disable Proxy Buffering for SSE

NGINX, the most common culprit:

nginx
location /v1/stream {
    proxy_pass http://upstream;
    proxy_http_version 1.1;
    proxy_buffering off;
    proxy_cache off;
    proxy_set_header X-Accel-Buffering no;
    add_header X-Accel-Buffering no;
    chunked_transfer_encoding on;
}

X-Accel-Buffering: no is the universal opt-out. Emit it from your app even when you do not know what is in front.

Knob 2 — Flush Immediately From Python

python
import asyncio
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
app = FastAPI()
async def stream_tokens(prompt: str):
    async for token in model.stream(prompt):
        yield f"data: {token}\n\n".encode("utf-8")
        # The crucial line. Without it, asyncio buffers 64 KB.
        await asyncio.sleep(0)
@app.post("/v1/stream")
async def stream(prompt: str):
    return StreamingResponse(
        stream_tokens(prompt),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",
            "Connection": "keep-alive",
        },
    )

asyncio.sleep(0) returns control to the event loop and forces the chunk onto the socket — one SSE event per token.

Knob 3 — Switch the OpenAI Client to Streaming Mode

python
from openai import OpenAI
client = OpenAI(base_url="https://api.openai.com/v1")
stream = client.chat.completions.create(
    model="gpt-4o-mini",
    stream=True,
    messages=[{"role": "user", "content": "Tell me a short story"}],
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

stream=True bypasses the SDK's response-collection buffer. Add flush=True on print when debugging.

Knob 4 — Add a 15s SSE Heartbeat

Idle SSE connections get dropped at 60s by corporate proxies, load balancers, and NAT gateways. A heartbeat keeps the socket warm without firing user-visible events:

python
async def stream_tokens(prompt: str):
    loop = asyncio.get_event_loop()
    yield b": ping\n\n"   # SSE comment, ignored by clients
    last = loop.time()
    async for token in model.stream(prompt):
        now = loop.time()
        if now - last > 15:
            yield b": ping\n\n"
            last = now
        yield f"data: {token}\n\n".encode()
        await asyncio.sleep(0)

The leading : is an SSE comment line. Clients ignore it; the proxy keeps the socket open.

Measure Before You Trust It

bash
hey -n 1000 -c 50 -T 30s -m POST \
  -H "Content-Type: application/json" \
  -d '{"prompt":"hi","stream":true}' \
  http://localhost:8000/v1/stream

Pre-fix p99 lands 20-50x the median on badly configured stacks. Post-fix, expect p99 within 3-5x. The win: two headers and one asyncio.sleep(0).

The Take

Streaming LLM tail latency lives in the plumbing, not the model. Disabling proxy buffering, flushing per token, keeping SSE warm, and using the streaming client API drops p99 by 5-10x on a stock endpoint. If users complain the model is slow, check the buffers first. The model is the easiest thing to blame and the least likely to be the bottleneck.

Mr. Technology


*Tested July 2026 on FastAPI 0.115+, NGINX 1.27+, openai-python 1.50+. Run all four knobs — they stack. SSE heartbeats should be 15-20s; shorter wastes bandwidth, longer risks proxy drops. X-Accel-Buffering: no is honored by NGINX, Cloudflare, and Fastly; verify Akamai and Azure CDN edge behaviors before deploying there.*

Related Dispatches