Google’s Gemma 4 Gets a 3x Speed Boost — Here’s Why It Matters
Google dropped something genuinely useful in the open-source AI world this week: Multi-Token Prediction drafters for Gemma 4 that deliver up to a 3x inference speedup, published by Olivier Lacombe (Director, Product Management) and Maarten Grootendorst (Developer Relations Engineer) on Google’s own blog yesterday, May 5, 2026.
For context: the original Gemma 4 models shipped just weeks ago and already hit 60 million downloads. Google’s not resting on that — they’re now releasing MTP drafters that make those same models run significantly faster on everything from consumer GPUs to mobile edge devices.
How It Works
The core problem is familiar to anyone who’s run a local LLM: inference is memory-bandwidth bound. With a 31B parameter model sitting in your VRAM, the GPU spends most of its cycles just shuttling weights around rather than doing useful compute. The result is frustratingly slow token generation, especially on consumer-grade GPUs.
Speculative decoding solves this by pairing a heavy target model with a lightweight “drafter” model. The drafter predicts several future tokens at once — fast, because it’s small. The target model then verifies all those predictions in a single parallel pass. When the drafter is right (and it’s surprisingly often right on obvious continuations), you get multiple tokens output for the cost of one forward pass.
From the Google blog:
“The target model agrees with the draft, it accepts the entire sequence in a single forward pass — and even generates an additional token of its own in the process.”
This means your app can output the full drafted sequence plus one token in the time it usually takes for a single token.
The Numbers
Google tested the MTP drafters across multiple inference runtimes — LiteRT-LM, MLX, Hugging Face Transformers, and vLLM — and the speedups are substantial. The drafters are available for the full Gemma 4 family:
- 26B Mixture of Experts and 31B Dense models for workstations and consumer GPUs
- E2B and E4B models for edge devices where every millisecond of battery counts
The best part: Google emphasizes zero quality degradation. Because the primary Gemma 4 model retains final verification of every token, you get identical frontier-class accuracy — just delivered significantly faster.
Why I’m Excited
This is the kind of technical advance that matters more than another benchmark score. The Gemma 4 models with MTP drafters make running capable open models locally actually feasible for the people who need it most — developers building coding assistants, autonomous agents, and on-device AI.
If you’ve been sitting on a GPU wondering whether local LLMs are practical for production workloads, this changes the calculus. A 3x speed boost doesn’t just make things faster — it makes things that would have been too slow suddenly viable.
The MTP drafters are available through the standard Gemma 4 channels. If you’re running Gemma 4 locally, update your setup.