Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
Abstract
Large vision-language models suffer from visual signal dilution during long sequence generation, which is mitigated by a lightweight persistent visual memory module that maintains visual attention and improves reasoning performance.
While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to ensure sustained, on-demand visual perception. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for precise visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM can resist length-induced signal decay and accelerate internal prediction convergence.
Community
Autoregressive LVLMs lose visual focus via "Visual Signal Dilution". We solve this with Persistent Visual Memory (PVM), a parallel branch shielding visual retrieval from growing text to ensure length-agnostic grounding and boost complex reasoning.
the core idea of pvm is neat: a parallel visual memory path that retrieves from a fixed visual token bank and gates it back into the main stream, so perception doesn't fade as text length grows. i’ve seen similar vibes in retrieval-augmented models, but decoupling the visual attention normalization from the autoregressive context is the clever bit that could actually scale insights to long-horizon reasoning. the arxivlens breakdown helped me parse the exact placement (bottleneck adapter, layers 8,16,24) and the 512-d latent bottleneck, which is a nice design sweet spot (https://arxivlens.com/PaperView/Details/persistent-visual-memory-sustaining-perception-for-deep-generation-in-lvlms-4767-46a7743b). one question: would you expect diminishing returns as you scale the visual memory size or move to denser visual tokens, and how sensitive is it to the choice of the fixed visual token bank?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Visual Enhanced Depth Scaling for Multimodal Latent Reasoning (2026)
- MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings (2026)
- Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding (2026)
- Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven Verification (2026)
- From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception (2026)
- CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models (2026)
- Visual Attention Drifts,but Anchors Hold:Mitigating Hallucination in Multimodal Large Language Models via Cross-Layer Visual Anchors (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.00814 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper