Beyond Single and Earthbound: Advancing Multi-image Grounding in Remote Sensing with Large Vision-Language Models

Abstract: The rapid advancement of Large Vision-Language Models (LVLMs) has showcased impressive perception and reasoning capabilities on visual-linguistic content. However, their full potential remains largely confined to an earthbound perspective, leading to unsatisfactory performance when confronted with fine-grained remote sensing image (RSI) interpretation due to the inherent domain gap. Although recent research has been dedicated to fine-tuning LVLMs with domain-specific datasets, yielding notable improvement in tasks such as visual grounding, the restriction of supporting only single-image input or the limited diversity of task categories continues to impede their practicality in remote sensing applications. To address these challenges, we propose to explore and advance the multi-image grounding (MIG) capability of LVLMs from a bird's-eye view, enabling more robust support for tasks such as change detection, object tracking, referring expression comprehension, and cross-view retrieval within the field of remote sensing. Therefore, we introduce MIGRANT, the first LVLM tailored for diverse remote sensing tasks requiring MIG to enable fine-grained interpretation across RSIs. To achieve this, we employ a two-stage training paradigm on a hierarchically constructed instruction-following dataset called MIG-RS-Instruct, which consists of two primary categories: Implicit Grounding and Explicit Grounding, and is divided into five leaf tasks in total. Given the absence of an established remote sensing MIG evaluation mechanism, we also propose the first benchmark, MIG-RS-Bench, consisting of 4,000 cases. Extensive experiments demonstrate that our MIGRANT achieves the best performance across all tasks in MIG-RS-Bench, and the additional fine-tuning on MIG further enhances the visual grounding capability and generalizability of the model, yielding approximately 15% improvement on mainstream single-image grounding benchmarks. We expect MIGRANT to serve as a valuable resource and provide insights for the future development of remote sensing MIG. Code and dataset are available at \href{https://github.com/ShawnAn-WHU/MIGRANT}{\textcolor{blue}{this https URL}}

Downloads last month: 9

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for An-Xiao/MIGRANT

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Finetuned

(806)

this model

Quantizations

1 model

An-Xiao
/

MIGRANT

Beyond Single and Earthbound: Advancing Multi-image Grounding in Remote Sensing with Large Vision-Language Models

Model tree for An-Xiao/MIGRANT

Datasets used to train An-Xiao/MIGRANT