About the inference efficiency

#3
by Abracdabra-H - opened

Hi there,

I am reading your work which looks quite reasonable for me.
But I have still got a question, it would be really nice if you can give me some hint

You just mentioned "However, this approach imposes a significant
burden on processing speed. For every frame, performing
the slow, recurrent, and lengthy next-token prediction with
a billion-scale language model makes it extremely hard to
achieve real-time video streaming dialogue."

instead of short answer for every frame, you propose to predict "eos", which you think can save time.
My question is: Your idea is like "do not predict a sentence for useless frame, but predict exactly one word(which is eos) for it. One word does not need to go through the self regressive process, which saves some time". Am I right? “which is still kind of frame by frame QA, but faster given it only predict one word for useless frames”

Thank you very much!

Yes, you are right! But note the "frame by frame QA" has history (KV cache) to be used

Sign up or log in to comment