Papers
arxiv:2604.08120

Small Vision-Language Models are Smart Compressors for Long Video Understanding

Published on Apr 9
ยท Submitted by
Junjie Fei
on Apr 10
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Tempo is an efficient framework that compresses long videos for multimodal understanding by using a small vision-language model for temporal compression and adaptive token allocation to maintain intent-aligned representations within strict budgets.

AI-generated summary

Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free O(1) dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.

Community

Paper author Paper submitter
โ€ข
edited about 17 hours ago

๐Ÿ”ฅ Tempo: Small Vision-Language Models are Smart Compressors for Long Video Understanding

How do we make MLLMs understand hour-long videos without saturating context windows? Tempo uses an SVLM to actively filter and compress videos via query-aware cross-modal distillation in a single forward pass!

๐Ÿ† SOTA Performance: Outperforms other long video MLLMs on the extreme-long LVBench (52.3 at 8K budget).

Everything is open-sourced! Try it out:

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.08120
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.08120 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.