Abstract
A novel vision encoder framework is presented that learns modality-agnostic feature representations by aligning multi-modal inputs while preserving semantic distinctions from a frozen teacher model.
Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to the output of a fully frozen teacher such as DINOv2. The resulting student encoder becomes "omnivorous" by producing a consistent, powerful embedding for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.
Community
We adapt DINOv2 into an "omnivorous" encoder that produces consistent embeddings for different input modalities like RGB, depth, and segmentation maps. By aligning paired modalities while anchoring to a frozen DINOv2 teacher, we unlock better cross-modal retrieval and transfer to novel visual modalities, all while preserving DINOv2's pretrained semantics.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Omni-C: Compressing Heterogeneous Modalities into a Single Dense Encoder (2026)
- Revisiting Multi-Task Visual Representation Learning (2026)
- ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion (2026)
- VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization (2026)
- OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams (2026)
- DeFM: Learning Foundation Representations from Depth for Robotics (2026)
- AnyThermal: Towards Learning Universal Representations for Thermal Perception (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper