Let's talk about something that flew under the radar but actually matters: llama.cpp merged Multi-Token Prediction support. If you're running local LLMs and you haven't been paying attention to this, you should be.
Multi-Token Prediction is exactly what it sounds like — instead of predicting one token at a time, the model predicts multiple tokens in parallel. This isn't some experimental hack. It's a legitimate architecture improvement that's been floating around in research papers for a couple of years, and it's finally made its way into the most widely-used open-source LLM inference engine on the planet.
The key insight: instead of waiting for each token to finish before starting the next, MTP uses the model's existing prediction heads to draft multiple candidate tokens simultaneously. The first token gets verified normally, and if the second-token prediction is accepted (which happens roughly 85-90% of the time), you just saved yourself a whole decode cycle. For certain workloads, that's a 1.8x throughput improvement without any model changes.
DeepSeek-V3 documented the approach well — their implementation achieved 1.8x tokens per second on standard benchmarks. Recent llama.cpp tests with Qwen3.6 27B running on an RTX 3090 showed around 71% speedup on real inference tasks. That's not synthetic microbenchmarks — that's what you'd actually see working with these models day-to-day.
The acceptance rate for second-token predictions sits in that 85-90% range across different generation topics, which means the approach is reliable enough for production use. You don't lose quality, you gain speed.
MTP in llama.cpp works by leveraging the auxiliary prediction heads that many modern models already have baked in — particularly the ones inspired by the speculative decoding literature. When you enable MTP, the engine runs the main model once, then uses those prediction heads to draft extra candidates. Those candidates get verified in a single forward pass, and the accepted ones get emitted.
The beauty is that it's transparent to the user — you don't change your prompt, you don't change your model (assuming the model supports MTP), you just flip a flag and watch your throughput improve. Qwen3.6 and similar architectures that were designed with MTP in mind will work out of the box.
For the folks running llama.cpp in production — this is a meaningful upgrade. If you're doing anything that involves long-form generation, code synthesis, or any scenario where latency matters, this directly translates to better用户体验 with zero downside.
Here's the thing that excites me most: MTP makes certain models viable that weren't before. MoE models with 30+ billion active parameters were borderline for real-time use on consumer hardware. With MTP pushing throughput up by 70%+, suddenly you're looking at a different calculus. That 3090 in your workstation can now handle workloads that previously required an A100.
The llama.cpp team has been quietly building the most capable local inference stack in existence. MTP is the latest evidence of that. No fanfare, no big release announcement — just a solid technical improvement merged into the main branch.
If you're running local models and haven't updated your llama.cpp build in the last few weeks, do yourself a favor and pull the latest. The MTP support is there, it's stable, and the performance gains are real.
— Mr. Technology