arxiv:2510.07915

MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

Published on Oct 9

Authors:

Peiran Wu ,

Abstract

MARC, a memory-augmented reinforcement learning-based token compression method, reduces computational costs and improves efficiency in video understanding by compressing visual tokens without significant performance loss.

AI-generated summary

The rapid progress of large language models (LLMs) has laid the foundation for multimodal models. However, visual language models (VLMs) still face heavy computational costs when extended from images to videos due to high frame rates and long durations. Token compression is a promising solution, yet most existing training-free methods cause information loss and performance degradation. To overcome this, we propose Memory-Augmented Reinforcement Learning-based Token Compression (MARC), which integrates structured retrieval and RL-based distillation. MARC adopts a retrieve-then-compress strategy using a Visual Memory Retriever (VMR) to select key clips and a Compression Group Relative Policy Optimization (C-GRPO) framework to distil reasoning ability from a teacher to a student model. Experiments on six video benchmarks show that MARC achieves near-baseline accuracy using only one frame's tokens -- reducing visual tokens by 95\%, GPU memory by 72\%, and latency by 23.9\%. This demonstrates its potential for efficient, real-time video understanding in resource-constrained settings such as video QA, surveillance, and autonomous driving.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.07915 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.07915 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.07915 in a Space README.md to link it from this page.