Gemma 4's 3x Speed Boost: What MTP Drafters Mean for On-Device AI
What Google Just Shipped
Google published Multi-Token Prediction (MTP) drafter companion models for Gemma 4, delivering up to 3x faster on-device AI inference with no quality degradation. The models shipped on May 5, authored by Google’s product and developer relations team. If the numbers hold in production, this is the most meaningful throughput improvement for on-device inference in the past two years.
The goal is explicit: make local inference fast enough that AI features no longer require a round-trip to the cloud.
Why LLM Inference Hits a Memory Wall
Standard LLM generation has a bottleneck that is easy to overlook: it is not compute-bound, it is memory-bandwidth bound. The processor — whether a desktop GPU, a cloud TPU, or a mobile NPU — spends most of its time moving model weights from memory to compute units, not doing arithmetic. For small batch sizes, which is the common case for single-user on-device inference, the hardware sits largely idle waiting on memory transfers.
This is part of why a 4B-parameter model does not feel 4x faster than a 16B model: fewer parameters means fewer bytes to move per token, so the throughput improvement is real, but the memory-bandwidth ceiling keeps latency high for all autoregressive models.
The other problem is serialization. Standard token generation is sequential by design: to generate token N+1, you must have token N. You cannot parallelize across the token axis at inference time. Every output token requires a full pass through the weight matrix — one more memory transfer, one more latency cycle.
How Speculative Decoding Breaks the Bottleneck
Speculative decoding exploits a fundamental asymmetry: verification is parallelizable, generation is not.
The approach works like this. A small, fast draft model generates the next N tokens in sequence — cheap, because the model is tiny. Then the large model, Gemma 4, verifies the entire batch of N draft tokens in a single forward pass instead of N separate passes. If the large model agrees with the first four draft tokens but rejects the fifth, it accepts those four and resamples from the fifth. The output is four tokens for the cost of roughly one large-model verification pass.
The expected speedup scales with the drafter’s acceptance rate. If the draft model predicts correctly 80% of the time, you are getting close to N effective tokens per verification cycle. When the draft is wrong, the large model’s correction produces the same result that standard sequential generation would have, so no tokens are wasted — only the rejected draft computation was, and that was cheap.
MTP Drafters: Google’s Implementation
Google’s drafter models use Multi-Token Prediction as a training objective — the model is trained to predict several future tokens simultaneously in a single forward pass, rather than optimizing only next-token accuracy. This makes them structurally better drafters than a standard language model of equivalent size, because they learn to anticipate token sequences rather than greedily predict the next token in isolation.
The companion models are purpose-built for Gemma 4: small enough to run alongside the main model on the same device, and trained on the same data distribution as Gemma 4, which keeps acceptance rates high. There is no separate inference infrastructure to manage — both models share hardware and the standard inference runtime.
The result Ars Technica and Android Authority both confirmed: up to 3x faster inference, no loss of output quality.
What This Changes for Mobile AI Features
The latency gap between on-device and cloud inference has been the primary reason most production AI features still route data to a remote server. A 2-3 second first-token wait is tolerable in a chat interface; it is fatal in a real-time text suggestion, a conversational UI transition, or inline summarization inside a content-heavy app.
3x faster on-device inference shifts the viability threshold for:
- Real-time text suggestions — sub-second response at the character level becomes achievable on mid-range hardware
- Conversational multi-turn flows — latency compounds across turns; compressing each step has an outsized effect on perceived responsiveness
- On-device RAG — local retrieval-augmented generation for search or document summarization without sending content offsite
- Privacy-sensitive features — health, finance, and communications data that users increasingly expect to stay on device
We have hit this tradeoff directly in apps like Amali and TeleTabeb: features that benefit most from AI inference are often cut or downgraded to a cloud call because on-device latency made them feel broken. A 3x throughput gain narrows that gap meaningfully, particularly on the high-end Android devices Gemma 4 targets.
What We’re Watching
The published numbers are benchmark figures. What we will be measuring over the next few months is how well the drafter acceptance rates hold against real user inputs — production prompts are noisier and more varied than benchmark sets. Speculative decoding’s speedup is directly proportional to how often the draft model is right; a domain-specific app with predictable prompt patterns will see larger gains than a general-purpose assistant.
The integration we are waiting on is MediaPipe and LiteRT surfacing the MTP drafters natively, so Android developers can enable speculative decoding without managing the draft-verify loop themselves. That runtime integration — not the model publication — is what will determine whether this improvement reaches production apps at scale.
If you are building an AI-powered mobile feature and are currently routing inference to the cloud because on-device latency is unacceptable, now is a good time to retest your assumptions.
Sources
- Accelerating Gemma 4: faster inference with multi-token prediction drafters — Blog.google, May 5 2026
- Google’s Gemma 4 open AI models use “speculative decoding” to get up to 3x faster — Ars Technica, May 6 2026
- Google’s latest trick gets Gemma 4 running 3x faster right on your phone — Android Authority, May 6 2026