AVoCaDO: An AudioVisual Video Captioner Driven by Temporal Orchestration
β¨ Overview
Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. We introduce AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance under visual-only settings.
π Getting Started
Please refer to our Github repository for more details.
βοΈ Citation
If you find our work helpful for your research, please consider giving a star β and citing our paper. We appreciate your support!
@article{chen2025avocado,
title={AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration},
author={Chen, Xinlong and Ding, Yue and Lin, Weihong and Hua, Jingyun and Yao, Linli and Shi, Yang and Li, Bozhou and Zhang, Yuanxing and Liu, Qiang and Wan, Pengfei and others},
journal={arXiv preprint arXiv:2510.10395},
year={2025}
}
- Downloads last month
- 139
Model tree for AVoCaDO-Captioner/AVoCaDO
Base model
Qwen/Qwen2.5-Omni-7B