arxiv:2601.05149

Multi-Scale Local Speculative Decoding for Image Generation

Published on Jan 8

· Submitted by

Amir Habibian on Jan 9

Qualcomm

Upvote

Authors:

Amirhossein Habibian

Abstract

Multi-Scale Local Speculative Decoding accelerates autoregressive image generation through multi-resolution drafting and spatially informed verification while maintaining semantic quality and perceptual fidelity.

AI-generated summary

Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with learned up-samplers to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. We demonstrate that MuLo-SD achieves substantial speedups - up to 1.7times - outperforming strong speculative decoding baselines such as EAGLE-2 and LANTERN in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity.

View arXiv page View PDF Project page Add to collection

Community

habibian

Paper author Paper submitter 1 day ago

Multi-Scale Local Speculative Decoding (MuLo-SD), a new framework to supercharge Autoregressive (AR) image generation!

By combining multi-resolution drafting with spatially informed verification, we achieve substantial speedups of up to 1.7x while maintaining high perceptual quality and semantic alignment.
The Core Idea: Unlike standard methods that use raster-scan rejection, MuLo-SD exploits the spatial structure of images. We propose candidate tokens using a low-resolution drafter and a learned up-sampler, which are then verified in parallel by a high-resolution target model.

Key Innovation: Local Verification 🔍 Crucially, we introduced a local rejection and resampling mechanism. Instead of discarding every token after a single error, we correct errors by focusing only on the spatial neighborhoods of rejected tokens, significantly boosting efficiency.

Results at a Glance: ✅ Outperforms strong baselines like EAGLE-2 and LANTERN. ✅ Consistently delivers greater speedups across 512p and 1024p resolutions. ✅ Integrates seamlessly with unified Multimodal LLMs.

🔗 https://qualcomm-ai-research.github.io/mulo-sd-webpage