Papers
arxiv:2604.14629

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

Published on Apr 16
· Submitted by
HaoyiSun
on Apr 17
Authors:
,
,
,
,
,
,

Abstract

Vision-language models face deployment challenges due to their large size, but knowledge distillation can improve efficiency while maintaining performance through a novel visual-switch framework that enhances multimodal knowledge transfer.

AI-generated summary

Vision-Language Models (VLMs) have shown remarkable capabilities in joint vision-language understanding, but their large scale poses significant challenges for deployment in resource-constrained scenarios. Knowledge Distillation (KD) offers a viable way to improve model capabilities without increasing model size or data requirements, making deployment more efficient. However, applying KD to VLMs is challenged by modality-specific supervision: although multimodal knowledge in VLMs is fused within the language space, current methods supervise each modality separately without explicitly addressing multimodal alignment, leading to inconsistent multimodal knowledge transfer. To address this, we propose Switch-KD, a visual-switch distillation framework that unifies vision-language knowledge transfer within a shared text-probability space. Switch-KD comprises two key components: (1) Visual-Switch Distillation, which switches the student's visual outputs into the teacher's language pathway to construct cross-modal probabilistic references for implicit visual knowledge transfer; and (2) Dynamic Bi-directional Logits Difference (DBiLD) loss, which adaptively aligns informative probability regions while preserving the distributional structures of teacher and student through bidirectional supervision. Guided by Switch-KD, a 0.5B TinyLLaVA effectively distills rich multimodal knowledge from its 3B teacher, yielding an average improvement of 3.6 points across 10 multimodal benchmarks without any architectural modification.

Community

Paper author Paper submitter

✨ Switch-KD is the open-source academic release of Li Auto's MindKD technology, published as Switch-KD at CVPR Findings 2026, enabling efficient vision-language model distillation through visual-switch supervision and unified multimodal knowledge transfer.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.14629
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.14629 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.14629 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.14629 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.