Why does a frame extraction strategy require such a large weight text?

by buildLLM - opened Jul 3, 2025

Discussion

buildLLM

Jul 3, 2025

Frame extraction is a small part of video understanding and should be as lightweight as possible.

yaolily

Owner Jul 7, 2025

You're right — while large MLLMs offer stronger multimodal capabilities and can better identify key frames with interleaved long context, efficiency is also crucial for practical frame extraction. It’s ultimately a trade-off between effectiveness and efficiency.

Fortunately, our generative frame sampler is a flexible design that supports different SOTA MLLMs. For example, we trained a Qwen2.5-VL-3B–based sampler using low-resolution input (112×112), which achieves better efficiency while still outperforming existing samplers in effectiveness (Table 8).

In addition, the heavier Aria-based sampler can also be combined with a CLIP-based sampler in a hybrid approach (Table 7), allowing us to balance both efficiency and effectiveness more effectively.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment