Large Multi-modal Models Can Interpret Features in Large Multi-modal Models Paper • 2411.14982 • Published Nov 22, 2024 • 19
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration Paper • 2411.17686 • Published Nov 26, 2024 • 20
On the Limitations of Vision-Language Models in Understanding Image Transforms Paper • 2503.09837 • Published Mar 12 • 10
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey Paper • 2503.12605 • Published Mar 16 • 35
When Less is Enough: Adaptive Token Reduction for Efficient Image Representation Paper • 2503.16660 • Published Mar 20 • 72
From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration Paper • 2503.12821 • Published Mar 17 • 9
Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models Paper • 2504.07951 • Published Apr 10 • 29
Textual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models Paper • 2505.14071 • Published May 20 • 1
To Trust Or Not To Trust Your Vision-Language Model's Prediction Paper • 2505.23745 • Published May 29 • 4
Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning Paper • 2506.04755 • Published Jun 5 • 37
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better Paper • 2506.09040 • Published Jun 10 • 34
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks Paper • 2507.01955 • Published Jul 2 • 35
Robust Multimodal Large Language Models Against Modality Conflict Paper • 2507.07151 • Published Jul 9 • 5
Automating Steering for Safe Multimodal Large Language Models Paper • 2507.13255 • Published Jul 17 • 3
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios Paper • 2507.20198 • Published Jul 27 • 26
Adapting Vision-Language Models Without Labels: A Comprehensive Survey Paper • 2508.05547 • Published Aug 7 • 11
Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success Paper • 2508.04280 • Published Aug 6 • 35
IAG: Input-aware Backdoor Attack on VLMs for Visual Grounding Paper • 2508.09456 • Published Aug 13 • 8
MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs Paper • 2508.18264 • Published Aug 25 • 25
Visual Representation Alignment for Multimodal Large Language Models Paper • 2509.07979 • Published Sep 9 • 82
Lost in Embeddings: Information Loss in Vision-Language Models Paper • 2509.11986 • Published Sep 15 • 27
When Big Models Train Small Ones: Label-Free Model Parity Alignment for Efficient Visual Question Answering using Small VLMs Paper • 2509.16633 • Published Sep 20 • 1
Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation Paper • 2509.22496 • Published about 1 month ago • 3
On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models Paper • 2510.09008 • Published 17 days ago • 14