This repository is the pre-trained weights and metadata of VL-SAE, which helps users to understand the vision-language alignment of VLMs via concepts.
The alignment of vision-language representations endows current Vision-Language Models (VLMs) with strong multi-modal reasoning capabilities. However, the interpretability of the alignment component remains uninvestigated due to the difficulty in mapping the semantics of multi-modal representations into a unified concept set. To address this problem, we propose VL-SAE, a sparse autoencoder that encodes vision-language representations into its hidden activations. Each neuron in its hidden layer correlates to a concept represented by semantically similar images and texts, thereby interpreting these representations with a unified concept set. To establish the neuron-concept correlation, we encourage semantically similar representations to exhibit consistent neuron activations during self-supervised training. First, to measure the semantic similarity of multi-modal representations, we perform their alignment in an explicit form based on cosine similarity. Second, we construct the VL-SAE with a distance-based encoder and two modality-specific decoders to ensure the activation consistency of semantically similar representations. Experiments across multiple VLMs (e.g., CLIP, LLaVA) demonstrate the superior capability of VL-SAE in interpreting and enhancing the vision-language alignment. For interpretation, the alignment between vision and language representations can be understood by comparing their semantics with concepts. For enhancement, the alignment can be strengthened by aligning vision-language representations at the concept level, contributing to performance improvements in downstream tasks, including zero-shot image classification and hallucination elimination.
Source codes are available at here.
# Download using huggingface_cli
pip install huggingface_hub
huggingface-cli download shufanshen/VL-SAE
# Download using git
git lfs install
git clone git@hf.co:shufanshen/VL-SAE
If you find VL-SAE useful for your research and applications, please cite using this BibTeX:
@misc{shen2025vlsae,
      title={VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set}, 
      author={Shufan Shen and Junshu Sun and Qingming Huang and Shuhui Wang},
      year={2025},
      eprint={2510.21323},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.21323}, 
}
