Why does a frame extraction strategy require such a large weight text?
Frame extraction is a small part of video understanding and should be as lightweight as possible.
You're right — while large MLLMs offer stronger multimodal capabilities and can better identify key frames with interleaved long context, efficiency is also crucial for practical frame extraction. It’s ultimately a trade-off between effectiveness and efficiency.
Fortunately, our generative frame sampler is a flexible design that supports different SOTA MLLMs. For example, we trained a Qwen2.5-VL-3B–based sampler using low-resolution input (112×112), which achieves better efficiency while still outperforming existing samplers in effectiveness (Table 8).
In addition, the heavier Aria-based sampler can also be combined with a CLIP-based sampler in a hybrid approach (Table 7), allowing us to balance both efficiency and effectiveness more effectively.
