AVoCaDO: An AudioVisual Video Captioner Driven by Temporal Orchestration

✨ Overview

Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. We introduce AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance under visual-only settings.

πŸš€ Getting Started

Please refer to our Github repository for more details.

βœ’οΈ Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citing our paper. We appreciate your support!

@article{chen2025avocado,
  title={AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration},
  author={Chen, Xinlong and Ding, Yue and Lin, Weihong and Hua, Jingyun and Yao, Linli and Shi, Yang and Li, Bozhou and Zhang, Yuanxing and Liu, Qiang and Wan, Pengfei and others},
  journal={arXiv preprint arXiv:2510.10395},
  year={2025}
}
Downloads last month
139
Safetensors
Model size
9B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for AVoCaDO-Captioner/AVoCaDO

Finetuned
(34)
this model