Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors
LangDC Overview
Current large video-language models face efficiency issues due to processing massive visual tokens. Existing fixed-ratio token compression ignores varying semantic density across video clips. Consequently, this lead to inadequate representation of information-rich clips due to insufficient tokens and unnecessary computation on static or content-poor ones. To address this, we propose LangDC, a Language-aware Dynamic Token Compressor. LangDC leverages a lightweight language model to describe video clips, converting them into soft caption tokens as visual representations. Trained with our proposed semantic density-aware supervision, LangDC aims to 1) cover key visual cues necessary for downstream task reasoning and 2) dynamically adjust compression ratios based on scene richness, reflected by descriptions length.
   
Contributions
- We propose LangDC, a novel language-aware token compression strategy. Using soft language tokens for visual representation, it adaptively adjusts compression ratios, improving token utilization over fixed-ratio techniques. 
- We propose semantic density-aware supervision for the token compressors. By explicitly providing reconstruction targets for token compression, we enable the derivation of a more compact feature set that is not only aware of information richness but also preserves key visual cues. 
- Experimental results demonstrate that our method reduces FLOPs by 49% relative to the strong baseline VideoGPT+, while maintaining competitive performance. Additional qualitative results show adaptive compression based on video clip semantic density. 
   
Installation
We recommend setting up a conda environment for the project:
conda create --name=langdc python=3.11
conda activate langdc
git clone https://github.com/NIneeeeeem/LangDC.git
cd LangDC
pip install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu124
pip install transformers==4.41.0
pip install -r requirements.txt
export PYTHONPATH="./:$PYTHONPATH"
Additionally, install FlashAttention for training,
pip install ninja
git clone https://github.com/HazyResearch/flash-attention.git
cd flash-attention
python setup.py install
Quantitative Evaluation π
We provide instructions to reproduce LangDC results on VideoMME, MVBench, LongVideoBench, VSIBench and four open-ended QA Benchmark. Please follow the instructions at eval/README.md.
To reproduce the results in Table 1 of the Motivation chapter, please refer to this repository.
Citations π:
If you're using LangDC in your research or applications, please give us a star β to support us and cite using this BibTeX:
@misc{wang2025seeing,
    title={Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors},
    author={Xiangchen Wang and Jinrui Zhang and Teng Wang and Haigang Zhang and Feng Zheng},
    year={2025},
    eprint={2509.00969},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}
Acknowledgements
- Video-ChatGPT+: A pioneering attempt in Video-based conversation models.
- LLaVA: Our code base is build upon LLaVA and Video-ChatGPT+.
Model tree for Wangxc1000/LangDC
Base model
Qwen/Qwen2.5-VL-3B-Instruct