Papers
arxiv:2601.05149

Multi-Scale Local Speculative Decoding for Image Generation

Published on Jan 8
Β· Submitted by
Amir Habibian
on Jan 9
Authors:
,
,

Abstract

Multi-Scale Local Speculative Decoding accelerates autoregressive image generation through multi-resolution drafting and spatially informed verification while maintaining semantic quality and perceptual fidelity.

AI-generated summary

Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with learned up-samplers to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. We demonstrate that MuLo-SD achieves substantial speedups - up to 1.7times - outperforming strong speculative decoding baselines such as EAGLE-2 and LANTERN in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity.

Community

Paper author Paper submitter

Multi-Scale Local Speculative Decoding (MuLo-SD), a new framework to supercharge Autoregressive (AR) image generation!

By combining multi-resolution drafting with spatially informed verification, we achieve substantial speedups of up to 1.7x while maintaining high perceptual quality and semantic alignment.
The Core Idea: Unlike standard methods that use raster-scan rejection, MuLo-SD exploits the spatial structure of images. We propose candidate tokens using a low-resolution drafter and a learned up-sampler, which are then verified in parallel by a high-resolution target model.

Key Innovation: Local Verification πŸ” Crucially, we introduced a local rejection and resampling mechanism. Instead of discarding every token after a single error, we correct errors by focusing only on the spatial neighborhoods of rejected tokens, significantly boosting efficiency.

Results at a Glance: βœ… Outperforms strong baselines like EAGLE-2 and LANTERN. βœ… Consistently delivers greater speedups across 512p and 1024p resolutions. βœ… Integrates seamlessly with unified Multimodal LLMs.

πŸ”— https://qualcomm-ai-research.github.io/mulo-sd-webpage

Paper author Paper submitter

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.05149 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.05149 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.05149 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.