I want to train a fine-timestamp-aligned True-Captioner

#4
by mifanbushipeicai - opened

The current model essentially outputs sequence descriptions rather than temporal descriptions—due to the inherent limitations of the Transformer architecture, it does not natively include timestamps.
I want to fine-tune a model through post-training that can align with timestamp outputs.
Such a model can accomplish many tasks, such as truly outputting real-time subtitle files !
What suggestions do you have regarding the specific training datasets and strategies to achieve such a model?

Sign up or log in to comment