Papers
arxiv:2510.09008

On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models

Published on Oct 10
· Submitted by Hoigi Seo on Oct 14
Authors:
,
,

Abstract

A method to reduce object hallucinations in large vision-language models by identifying and masking uncertain visual tokens in the vision encoder.

AI-generated summary

Large vision-language models (LVLMs), which integrate a vision encoder (VE) with a large language model, have achieved remarkable success across various tasks. However, there are still crucial challenges in LVLMs such as object hallucination, generating descriptions of objects that are not in the input image. Here, we argue that uncertain visual tokens within the VE is a key factor that contributes to object hallucination. Our statistical analysis found that there are positive correlations between visual tokens with high epistemic uncertainty and the occurrence of hallucinations. Furthermore, we show theoretically and empirically that visual tokens in early VE layers that exhibit large representation deviations under small adversarial perturbations indicate high epistemic uncertainty. Based on these findings, we propose a simple yet effective strategy to mitigate object hallucination by modifying the VE only. Our method comprises a proxy method with adversarial perturbations for identifying uncertain visual tokens efficiently and a method to mask these uncertain visual tokens during the self-attention process in the middle layers of the VE, suppressing their influence on visual encoding and thus alleviating hallucinations. Extensive experiments show that our method significantly reduces object hallucinations in LVLMs and can synergistically work with other prior arts.

Community

Paper author Paper submitter

This paper investigates the problem of object hallucination—when large vision-language models (LVLMs) describe objects that don’t actually appear in an image. The authors reveal that epistemic uncertainty in visual tokens from the vision encoder (VE) is a key factor behind these hallucinations.

To address this, they propose a simple yet effective method that:
• Detects uncertain visual tokens using adversarial perturbations 🧠⚡
• Masks these uncertain tokens during the self-attention process in the vision encoder 🖼️🔍
• Works efficiently without retraining and can be combined with existing mitigation techniques

Extensive experiments on benchmarks like CHAIR, POPE, and AMBER show significant reductions in hallucination rates while maintaining high caption quality. This approach provides new insights into how visual uncertainty affects model reliability and offers a lightweight solution for more trustworthy LVLMs 🤖✨.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.09008 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.09008 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.09008 in a Space README.md to link it from this page.

Collections including this paper 2