AVoCaDO: An AudioVisual Video Captioner Driven by Temporal Orchestration

✨ Overview

Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. We introduce AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance under visual-only settings.

🚀 Getting Started

Please refer to our Github repository for more details.

✒️ Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citing our paper. We appreciate your support!

@article{chen2025avocado,
  title={AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration},
  author={Chen, Xinlong and Ding, Yue and Lin, Weihong and Hua, Jingyun and Yao, Linli and Shi, Yang and Li, Bozhou and Zhang, Yuanxing and Liu, Qiang and Wan, Pengfei and others},
  journal={arXiv preprint arXiv:2510.10395},
  year={2025}
}

Downloads last month: 139

Safetensors

Model size

9B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AVoCaDO-Captioner/AVoCaDO

Base model

Qwen/Qwen2.5-Omni-7B

Finetuned

(34)

this model