arxiv:2603.16932

Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

Published on Mar 14

· Submitted by

Nimrod Shabtay on Mar 24

#2 Paper of the day

IBM Research

Upvote

Authors:

Moshe Kimhi ,

Abstract

AwaRes is a spatial-on-demand framework for vision-language models that dynamically retrieves high-resolution image segments based on query needs, using tool-calling and multi-turn reinforcement learning with composite rewards.

AI-generated summary

Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes

View arXiv page View PDF Project page GitHub 9 Add to collection

Community

NimrodShabtay1986

Paper submitter 1 day ago

Vision-language models (VLMs) typically process images at native high resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs promote efficiency but potentially miss critical visual information like small text.

We present AwaRes, a spatial-on-demand framework that resolves this accuracy–efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only the high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs. high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward combining semantic answer correctness with explicit crop-cost penalties.

AwaRes provides a practical, deployment-friendly path to high-detail VLM reasoning under tight compute budgets.

librarian-bot

about 16 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

avahal

about 5 hours ago

the real punchline here is the coupled-decision policy that first asks if we need higher res and then selects a discrete crop, all while keeping kv-cache friendly. the supervision pipeline, a llama judge plus an oracle grounding model trained automatically, feels brittle to domain shift; i wonder how robust it is when the query shifts or the text cues are noisy. would love to see an ablation where you replace the judge/oracle with a single end-to-end gating module that predicts need and crop in one shot, and compare accuracy and latency. also, how sensitive is the method to the granularity of the predefined crop set; if you only have a handful of crops you might miss small but critical cues in dense charts. btw, the arxivlens breakdown helped me parse the method details and covers this well: https://arxivlens.com/PaperView/Details/look-where-it-matters-high-resolution-crops-retrieval-for-efficient-vlms-6568-8d3b20ce