Why does a frame extraction strategy require such a large weight text?

#2
by buildLLM - opened

Frame extraction is a small part of video understanding and should be as lightweight as possible.

You're right — while large MLLMs offer stronger multimodal capabilities and can better identify key frames with interleaved long context, efficiency is also crucial for practical frame extraction. It’s ultimately a trade-off between effectiveness and efficiency.

Fortunately, our generative frame sampler is a flexible design that supports different SOTA MLLMs. For example, we trained a Qwen2.5-VL-3B–based sampler using low-resolution input (112×112), which achieves better efficiency while still outperforming existing samplers in effectiveness (Table 8).

In addition, the heavier Aria-based sampler can also be combined with a CLIP-based sampler in a hybrid approach (Table 7), allowing us to balance both efficiency and effectiveness more effectively.

image.png

Sign up or log in to comment