Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs
Abstract
AwaRes is a spatial-on-demand framework for vision-language models that dynamically retrieves high-resolution image segments based on query needs, using tool-calling and multi-turn reinforcement learning with composite rewards.
Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes
Community
Vision-language models (VLMs) typically process images at native high resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs promote efficiency but potentially miss critical visual information like small text.
We present AwaRes, a spatial-on-demand framework that resolves this accuracy–efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only the high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs. high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward combining semantic answer correctness with explicit crop-cost penalties.
AwaRes provides a practical, deployment-friendly path to high-detail VLM reasoning under tight compute budgets.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery (2026)
- Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning (2026)
- LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding (2026)
- ConsensusDrop: Fusing Visual and Cross-Modal Saliency for Efficient Vision Language Models (2026)
- Towards Pixel-Level VLM Perception via Simple Points Prediction (2026)
- Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning (2026)
- GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
the real punchline here is the coupled-decision policy that first asks if we need higher res and then selects a discrete crop, all while keeping kv-cache friendly. the supervision pipeline, a llama judge plus an oracle grounding model trained automatically, feels brittle to domain shift; i wonder how robust it is when the query shifts or the text cues are noisy. would love to see an ablation where you replace the judge/oracle with a single end-to-end gating module that predicts need and crop in one shot, and compare accuracy and latency. also, how sensitive is the method to the granularity of the predefined crop set; if you only have a handful of crops you might miss small but critical cues in dense charts. btw, the arxivlens breakdown helped me parse the method details and covers this well: https://arxivlens.com/PaperView/Details/look-where-it-matters-high-resolution-crops-retrieval-for-efficient-vlms-6568-8d3b20ce
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper