
Your self-hosted Qwen2.5-32B is doing 80 tokens/sec single-stream. Fine for one user, embarrassing for ten. Speculative decoding gets you ~1.8× throughput on chat workloads for the cost of one tiny draft model and three CLI flags. Most setups break silently, and the docs don't tell you why.
Prerequisite: vLLM 0.6+, CUDA 12.1+ host, a main model whose tokenizer is published on Hugging Face (Qwen, Llama, Mistral all qualify).
Pull a small draft model that shares the exact tokenizer with your main model. The tokenizer match is non-negotiable — vLLM does not check it, and a mismatch yields plausible-looking nonsense that you will blame on prompt formatting for an hour.
vllm serve Qwen/Qwen2.5-32B-Instruct-AWQ \ --port 8000 \ --quantization awq \ --gpu-memory-utilization 0.92 \ --speculative-model Qwen/Qwen2.5-0.5B-Instruct \ --num-speculative-tokens 5 \ --speculative-disable-by-batch-size 16
The three new flags do everything:
--speculative-model — the draft. Same tokenizer family as the main model.--num-speculative-tokens 5 — draft 5 tokens ahead per step. 4–8 is the sweet spot; below 3 the overhead dominates, above 8 acceptance rate collapses.--speculative-disable-by-batch-size 16 — when 16+ requests queue, fall back to normal decoding. Speculation loses at high concurrency because draft-verification work exceeds the savings.On an A100-80GB serving Qwen2.5-32B-AWQ, the same 200-prompt eval moves from 18.4 tok/s/user (no spec) to 32.1 tok/s/user at concurrency 4. At concurrency 16 the two are statistically identical — the auto-disable kicks in.
Tokenizer mismatch is silent. Pick a draft model from a different family, vLLM loads fine, the API works, the output is gibberish, you blame the prompt. The cheap fix: same family. Qwen/Qwen2.5-0.5B-Instruct for Qwen2.5-32B. Llama-3.2-1B-Instruct for Llama-3.1-70B. Cross-family works only if the tokenizers are byte-for-byte equivalent, which almost never holds.
TTFT gets worse. First-token latency rises 30–80 ms because the draft has to prefill before the main model verifies. For chat UIs this is invisible. For agents doing one-token tool calls in tight loops, it is death — disable speculation when serving tool-heavy traffic.
Draft size is a tradeoff. Too small (≤0.3B) and acceptance rate collapses. Too large (≥3B) and draft-prefill cost eats the savings. The sweet spot is roughly 1/30 to 1/60 the size of the main model. The 0.5B-for-32B pairing is the most reliable I have measured.
Run a fixed workload at concurrency 1, 4, 8, 16 with and without speculation. Plot tok/s/user and TTFT. The win only matters where the line moves up — usually concurrency 2–8 for chat, 1–2 for code completion.
import asyncio, time
from openai import AsyncOpenAI
async def bench(prompt: str, n: int = 20, conc: int = 4):
client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="x")
sem = asyncio.Semaphore(conc)
async def one():
async with sem:
t0 = time.perf_counter()
r = await client.chat.completions.create(
model="Qwen/Qwen2.5-32B-Instruct-AWQ",
messages=[{"role": "user", "content": prompt}],
max_tokens=200,
)
return (time.perf_counter() - t0) / max(len(r.choices[0].message.content), 1)
t = time.perf_counter()
rates = await asyncio.gather(*[one() for _ in range(n)])
print(f"conc={conc} tok/s/user={1 / (sum(rates)/len(rates)):.1f}")
print(f" total wall={time.perf_counter()-t:.1f}s")vLLM + speculative decoding pays off when you are serving chat at concurrency 2–10 with a >7B main model. Skip it for tool-calling agents (TTFT cost), for ≤3B main models (overhead dominates), and when you are already GPU-saturated — the draft prefill needs headroom to fit.
— Mr. Technology
*vLLM 0.6+ speculative decoding supports the --speculative-model family for n-gram and draft-model modes; EAGLE and Medusa are separate paths. Tested July 2026 on Qwen2.5-32B-Instruct-AWQ + Qwen2.5-0.5B-Instruct on a single A100-80GB. Same-family pairing is the rule; cross-family only works if the tokenizers are byte-equivalent, which they almost never are. Disable for any tool-calling-heavy workload — TTFT regression compounds per turn.*