chenjoya/videollm-online-8b-v1plus · About the inference efficiency

Hi there,

I am reading your work which looks quite reasonable for me.
But I have still got a question, it would be really nice if you can give me some hint

You just mentioned "However, this approach imposes a significant
burden on processing speed. For every frame, performing
the slow, recurrent, and lengthy next-token prediction with
a billion-scale language model makes it extremely hard to
achieve real-time video streaming dialogue."

instead of short answer for every frame, you propose to predict "eos", which you think can save time.
My question is: Your idea is like "do not predict a sentence for useless frame, but predict exactly one word(which is eos) for it. One word does not need to go through the self regressive process, which saves some time". Am I right? “which is still kind of frame by frame QA, but faster given it only predict one word for useless frames”

Thank you very much!