Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding
Abstract
A systematic analysis of language prior in large vision-language models reveals a Visual Integration Point and introduces a Total Visual Integration estimator to quantify visual influence on response generation.
Large vision-language models (LVLMs) achieve strong performance on multimodal tasks, yet they often default to their language prior (LP) -- memorized textual patterns from pre-training while under-utilizing visual evidence. Prior analyses of LP mostly rely on input-output probing, which fails to reveal the internal mechanisms governing when and how vision influences model behavior. To address this gap, we present the first systematic analysis of language prior through the lens of chain-of-embedding, which examines the layer-wise representation dynamics within LVLMs. Our analysis reveals a universal phenomenon: each model exhibits a Visual Integration Point (VIP), a critical layer at which visual information begins to meaningfully reshape hidden representations and influence decoding. Building on this observation, we introduce the Total Visual Integration (TVI) estimator, which aggregates representation distance beyond the VIP to quantify how strongly visual query influences response generation. Across 54 model-dataset combinations spanning 9 contemporary LVLMs and 6 benchmarks, we demonstrate that VIP consistently emerges, and that TVI reliably predicts the strength of language prior. This offers a principled toolkit for diagnosing and understanding language prior in LVLMs.
Community
We provide a principled framework to understand and quantify "language prior" of large vision language models.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding (2025)
- Visual Representation Alignment for Multimodal Large Language Models (2025)
- Lost in Embeddings: Information Loss in Vision-Language Models (2025)
- From Bias to Balance: Exploring and Mitigating Spatial Bias in LVLMs (2025)
- Latent Visual Reasoning (2025)
- Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models (2025)
- Modality Bias in LVLMs: Analyzing and Mitigating Object Hallucination via Attention Lens (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper