Start a project

← All posts

Gemma 4's 3x Inference Boost: How Speculative Decoding Really Works

· Dracode · ai · developer-tools · android · on-device-ai
Abstract artistic visualization of neural network data flows and algorithms, from the Google DeepMind Visualising AI project

What Google shipped this week

Google’s Gemma 4 speculative decoding story got concrete this week: the team published dedicated MTP (Multi-Token Prediction) drafter models that promise up to 3x faster inference with no degradation in output quality. These aren’t architecture changes to Gemma 4 itself — they’re separate companion models trained to work alongside the base model in a speculative decoding pipeline. The announcement landed on Google’s AI developer blog on May 5, authored by the Gemma product and developer relations teams.

How Gemma 4 speculative decoding works

Standard transformer inference is serial: each forward pass through the model produces exactly one token. At batch size 1, modern GPUs and NPUs are memory-bandwidth-bound, not compute-bound — much of the hardware sits idle between token generations.

Speculative decoding, introduced in 2023, flips the throughput equation. A small, fast drafter model generates a sequence of N candidate tokens. The large target model then runs a single forward pass to score all N positions in parallel — because transformers process sequences in parallel by design, verifying N tokens costs only marginally more than verifying one. A statistically rigorous acceptance criterion either accepts draft tokens or replaces any that diverge, guaranteeing the final output distribution is identical to what the target model would have produced alone. No quality tradeoff. None.

The result: you generate multiple tokens per target model forward pass instead of one. On memory-bandwidth-limited hardware, this translates directly to lower wall-clock latency.

What’s different about Google’s MTP drafters

The “multi-token prediction” in MTP refers to the training objective, not just the inference strategy. Rather than using any convenient small model as the drafter, Google trained dedicated companion models jointly optimized to predict token sequences that Gemma 4 is likely to accept. A drafter that consistently matches what the verifier would have produced achieves a high acceptance rate; a poorly aligned drafter gives minimal speedup because the target model’s fallback token fires on nearly every step.

The practical implication: these drafter models are not general-purpose. They’re checkpoints released alongside Gemma 4, useful only in a speculative decoding pipeline with the matching Gemma 4 base model. The tighter the drafter–verifier coupling, the higher the acceptance rate, and the closer you get to the headline speedup.

When “up to 3x” actually holds

Three times is the upper bound, not the average. The actual speedup is a function of the acceptance rate — how often the drafter’s predicted token matches what the verifier would have chosen — and the draft length. High acceptance on predictable tasks like code generation, structured data extraction, and document summarization gets you close to the ceiling. Open-ended creative generation, where many valid next tokens exist, lowers the acceptance rate and shrinks the gain. Ars Technica notes the quality preservation claim is real — the acceptance criterion is a mathematical guarantee, not a heuristic — even when the speedup falls short of 3x.

Two real-world constraints matter for on-device deployments. First, you need both models in memory simultaneously. On a device with 6–8 GB of RAM available to the inference process, fitting a drafter alongside the full Gemma 4 model is tight. Second, the drafter runs serially before the verifier’s parallel pass, adding overhead on hardware where even small model inference is slow. The Android Authority coverage emphasizes the on-device angle, which suggests Google is targeting modern flagship NPUs — not the mid-range baseline.

What this changes for mobile AI features

On-device inference latency is the primary UX constraint for real-time AI features in mobile apps. A first-token latency above ~300ms reads as lag; a full response that takes more than 2 seconds interrupts the user’s thought. A Gemma 4 variant sitting at 20–30 tokens/sec on a mid-range NPU becomes a different product at 60+ tokens/sec. That’s the gap between “background task” and “interactive feature.”

For teams building AI-native mobile apps, the practical question right now is toolchain support. MediaPipe Tasks and the Google AI SDK for Android are the most likely integration paths. Whether the drafter–verifier pair runs out-of-the-box in those SDKs, or requires custom orchestration, will determine how quickly this reaches production apps rather than staying a benchmark result. The technique is sound; the DX is what ships.

The broader signal is the competitive dynamic. On-device AI inference speed is a differentiator that Google, Apple, and Qualcomm are all fighting over. Speculative decoding has been standard at the server level for over a year. Google bringing it to Gemma 4’s on-device story is an escalation of that race, and it puts pressure on Apple to ship equivalent gains for on-device models before WWDC.

What we’re watching

Acceptance rates on real app workloads — not synthetic benchmarks. We want to see developers publish latency measurements for typical mobile use cases: conversational completion, document Q&A, on-device classification. The headline number will look different in those conditions.

Whether Apple follows with equivalent drafter support for their on-device model lineup is the other question. If speculative decoding becomes standard on Android, iOS developers will reasonably expect the same from Core ML or whatever Apple announces for on-device inference this year.

We’re also watching memory overhead in practice. The 1B Gemma 4 variant is the relevant baseline for broad device support. Whether a drafter model for the 1B fits the memory envelope of phones with 8 GB RAM determines whether this is a flagship-only feature or something that ships at scale — and that distinction matters for the apps we work on with our clients.

Sources

  1. Accelerating Gemma 4: faster inference with multi-token prediction drafters — Google AI Developer Blog, May 5, 2026
  2. Google’s latest trick gets Gemma 4 running 3x faster right on your phone — Android Authority, May 6, 2026
  3. Google’s Gemma 4 open AI models use “speculative decoding” to get up to 3x faster — Ars Technica, May 6, 2026
  4. Fast Inference from Transformers via Speculative Decoding — Leviathan et al., 2023