X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again
Abstract
Reinforcement learning enhances discrete autoregressive modeling for image and language generation, achieving high-quality image generation and instruction-following capabilities using a unified framework.
Numerous efforts have been made to extend the ``next token prediction'' paradigm to visual contents, aiming to create a unified approach for both image generation and understanding. Nevertheless, attempts to generate images through autoregressive modeling with discrete tokens have been plagued by issues such as low visual fidelity, distorted outputs, and failure to adhere to complex instructions when rendering intricate details. These shortcomings are likely attributed to cumulative errors during autoregressive inference or information loss incurred during the discretization process. Probably due to this challenge, recent research has increasingly shifted toward jointly training image generation with diffusion objectives and language generation with autoregressive objectives, moving away from unified modeling approaches. In this work, we demonstrate that reinforcement learning can effectively mitigate artifacts and largely enhance the generation quality of a discrete autoregressive modeling method, thereby enabling seamless integration of image and language generation. Our framework comprises a semantic image tokenizer, a unified autoregressive model for both language and images, and an offline diffusion decoder for image generation, termed X-Omni. X-Omni achieves state-of-the-art performance in image generation tasks using a 7B language model, producing images with high aesthetic quality while exhibiting strong capabilities in following instructions and rendering long texts.
Community
X-Omni is a unified discrete autoregressive model for both image and language modalities. X-Omni achieves state-of-the-art performance in image generation tasks using a 7B language model, producing images with high aesthetic quality while exhibiting strong capabilities in following instructions and rendering long texts.
Really cool, thanks for sharing.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better (2025)
- Resurrect Mask AutoRegressive Modeling for Efficient and Scalable Image Generation (2025)
- MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models (2025)
- Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation (2025)
- Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation (2025)
- Show-o2: Improved Native Unified Multimodal Models (2025)
- LVLM-Composer's Explicit Planning for Image Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend