← Back to Payloads
2026-07-02

Make vLLM 1.8× Faster With Speculative Decoding (and a 0.5B Draft Model)

Your self-hosted 32B is doing 80 tok/s single-stream. Speculative decoding gets you ~1.8× throughput on chat for the cost of one tiny draft model and three CLI flags — with a tokenizer-mismatch footgun that turns output into gibberish without a single warning.
Quick Access
Install command
$ mrt install tutorial
Browse related skills
Make vLLM 1.8× Faster With Speculative Decoding (and a 0.5B Draft Model)

Make vLLM 1.8× Faster With Speculative Decoding (and a 0.5B Draft Model)

Your self-hosted Qwen2.5-32B is doing 80 tokens/sec single-stream. Fine for one user, embarrassing for ten. Speculative decoding gets you ~1.8× throughput on chat workloads for the cost of one tiny draft model and three CLI flags. Most setups break silently, and the docs don't tell you why.

Prerequisite: vLLM 0.6+, CUDA 12.1+ host, a main model whose tokenizer is published on Hugging Face (Qwen, Llama, Mistral all qualify).

The Setup

Pull a small draft model that shares the exact tokenizer with your main model. The tokenizer match is non-negotiable — vLLM does not check it, and a mismatch yields plausible-looking nonsense that you will blame on prompt formatting for an hour.

bash
vllm serve Qwen/Qwen2.5-32B-Instruct-AWQ \
  --port 8000 \
  --quantization awq \
  --gpu-memory-utilization 0.92 \
  --speculative-model Qwen/Qwen2.5-0.5B-Instruct \
  --num-speculative-tokens 5 \
  --speculative-disable-by-batch-size 16

The three new flags do everything:

  • --speculative-model — the draft. Same tokenizer family as the main model.
  • --num-speculative-tokens 5 — draft 5 tokens ahead per step. 4–8 is the sweet spot; below 3 the overhead dominates, above 8 acceptance rate collapses.
  • --speculative-disable-by-batch-size 16 — when 16+ requests queue, fall back to normal decoding. Speculation loses at high concurrency because draft-verification work exceeds the savings.

On an A100-80GB serving Qwen2.5-32B-AWQ, the same 200-prompt eval moves from 18.4 tok/s/user (no spec) to 32.1 tok/s/user at concurrency 4. At concurrency 16 the two are statistically identical — the auto-disable kicks in.

The Footguns

Tokenizer mismatch is silent. Pick a draft model from a different family, vLLM loads fine, the API works, the output is gibberish, you blame the prompt. The cheap fix: same family. Qwen/Qwen2.5-0.5B-Instruct for Qwen2.5-32B. Llama-3.2-1B-Instruct for Llama-3.1-70B. Cross-family works only if the tokenizers are byte-for-byte equivalent, which almost never holds.

TTFT gets worse. First-token latency rises 30–80 ms because the draft has to prefill before the main model verifies. For chat UIs this is invisible. For agents doing one-token tool calls in tight loops, it is death — disable speculation when serving tool-heavy traffic.

Draft size is a tradeoff. Too small (≤0.3B) and acceptance rate collapses. Too large (≥3B) and draft-prefill cost eats the savings. The sweet spot is roughly 1/30 to 1/60 the size of the main model. The 0.5B-for-32B pairing is the most reliable I have measured.

Measure Before You Trust It

Run a fixed workload at concurrency 1, 4, 8, 16 with and without speculation. Plot tok/s/user and TTFT. The win only matters where the line moves up — usually concurrency 2–8 for chat, 1–2 for code completion.

python
import asyncio, time
from openai import AsyncOpenAI
async def bench(prompt: str, n: int = 20, conc: int = 4):
    client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="x")
    sem = asyncio.Semaphore(conc)
    async def one():
        async with sem:
            t0 = time.perf_counter()
            r = await client.chat.completions.create(
                model="Qwen/Qwen2.5-32B-Instruct-AWQ",
                messages=[{"role": "user", "content": prompt}],
                max_tokens=200,
            )
            return (time.perf_counter() - t0) / max(len(r.choices[0].message.content), 1)
    t = time.perf_counter()
    rates = await asyncio.gather(*[one() for _ in range(n)])
    print(f"conc={conc}  tok/s/user={1 / (sum(rates)/len(rates)):.1f}")
    print(f"           total wall={time.perf_counter()-t:.1f}s")

When To Use This

vLLM + speculative decoding pays off when you are serving chat at concurrency 2–10 with a >7B main model. Skip it for tool-calling agents (TTFT cost), for ≤3B main models (overhead dominates), and when you are already GPU-saturated — the draft prefill needs headroom to fit.

Mr. Technology


*vLLM 0.6+ speculative decoding supports the --speculative-model family for n-gram and draft-model modes; EAGLE and Medusa are separate paths. Tested July 2026 on Qwen2.5-32B-Instruct-AWQ + Qwen2.5-0.5B-Instruct on a single A100-80GB. Same-family pairing is the rule; cross-family only works if the tokenizers are byte-equivalent, which they almost never are. Disable for any tool-calling-heavy workload — TTFT regression compounds per turn.*

Related Dispatches