Abstract
Latent Visual Reasoning (LVR) enhances visual question answering by enabling autoregressive reasoning in the visual embedding space, improving fine-grained visual understanding.
Multimodal Large Language Models (MLLMs) have achieved notable gains in various tasks by incorporating Chain-of-Thought (CoT) reasoning in language spaces. Recent work extends this direction by leveraging external tools for visual editing, thereby enhancing the visual signal along the reasoning trajectories. Nevertheless, these approaches remain fundamentally constrained: reasoning is still confined to the language space, with visual information treated as static preconditions. We introduce Latent Visual Reasoning (LVR), a new paradigm that enables autoregressive reasoning directly in the visual embedding space. A visual encoder first projects images into visual tokens within a joint semantic space shared with the language model. The language model is then trained to generate latent states that reconstruct key visual tokens critical for answering the query, constituting the process of latent visual reasoning. By interleaving LVR with standard text generation, our model achieves substantial gains on perception-intensive visual question answering tasks. In addition, we adapt the GRPO algorithm to conduct reinforcement learning on latent reasoning, further balancing LVR and textual generation. We show that LVR substantially improves fine-grained visual understanding and perception, achieving 71.67% on MMVP compared to 66.67% with Qwen2.5-VL. Code base and model weights will be released later.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space (2025)
- Self-Rewarding Vision-Language Model via Reasoning Decomposition (2025)
- DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning (2025)
- Visual Representation Alignment for Multimodal Large Language Models (2025)
- Look Less, Reason More: Rollout-Guided Adaptive Pixel-Space Reasoning (2025)
- Growing Visual Generative Capacity for Pre-Trained MLLMs (2025)
- LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper