I want to train a fine-timestamp-aligned True-Captioner
#4
by
mifanbushipeicai
- opened
The current model essentially outputs sequence descriptions rather than temporal descriptions—due to the inherent limitations of the Transformer architecture, it does not natively include timestamps.
I want to fine-tune a model through post-training that can align with timestamp outputs.
Such a model can accomplish many tasks, such as truly outputting real-time subtitle files !
What suggestions do you have regarding the specific training datasets and strategies to achieve such a model?