I want to train a fine-timestamp-aligned True-Captioner

by mifanbushipeicai - opened Oct 28

Oct 28

The current model essentially outputs sequence descriptions rather than temporal descriptions—due to the inherent limitations of the Transformer architecture, it does not natively include timestamps.
I want to fine-tune a model through post-training that can align with timestamp outputs.
Such a model can accomplish many tasks, such as truly outputting real-time subtitle files !
What suggestions do you have regarding the specific training datasets and strategies to achieve such a model?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment