| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - vision-language |
| - vlm |
| - grpo |
| - earthmind |
| - geospatial |
| - remote-sensing |
| library_name: transformers |
| pipeline_tag: image-text-to-text |
| --- |
| |
| # EarthMind-R1 |
|
|
| EarthMind-R1 is a vision-language model fine-tuned using GRPO (Group Relative Policy Optimization) for geospatial and remote sensing image understanding tasks. |
|
|
| ## Model Description |
|
|
| - **Base Model:** EarthMind-4B |
| - **Training Method:** GRPO (Group Relative Policy Optimization) |
| - **Training Data:** Geospatial instruction dataset |
| - **Fine-tuning:** LoRA adapters merged into base weights |
|
|
| ## Usage |
|
|
| ### Quick Start |
|
|
| ```python |
| import torch |
| from PIL import Image |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| # Load model and tokenizer |
| model_id = "aadex/Earthmind-R1" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_id, |
| trust_remote_code=True, |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| ) |
| |
| # Load an image |
| image = Image.open("your_image.jpg").convert("RGB") |
| |
| # Ask a question |
| question = "Describe what you see in this satellite image." |
| |
| # Use model's chat interface |
| response = model.chat( |
| tokenizer=tokenizer, |
| question=question, |
| images=[image], |
| generation_config={ |
| "max_new_tokens": 512, |
| "temperature": 0.7, |
| "do_sample": True, |
| }, |
| ) |
| |
| print(response) |
| ``` |
|
|
| ### Expected Output Format |
|
|
| The model is trained to provide structured responses: |
|
|
| ``` |
| <think> |
| [Reasoning about the image content] |
| </think> |
| <answer> |
| [Final answer to the question] |
| </answer> |
| ``` |
|
|
| ## Requirements |
|
|
| ``` |
| torch>=2.0 |
| transformers>=4.40 |
| accelerate |
| pillow |
| ``` |
|
|
| ## Hardware Requirements |
|
|
| - **Minimum:** 16GB VRAM (with bfloat16) |
| - **Recommended:** 24GB VRAM for comfortable inference |
|
|
| ## Training Details |
|
|
| - **Framework:** VLM-R1 + TRL |
| - **Optimizer:** AdamW |
| - **Learning Rate:** 1e-6 |
| - **LoRA Configuration:** |
| - r: 32 |
| - alpha: 64 |
| - dropout: 0.05 |
| - **GRPO Settings:** |
| - num_generations: 4 |
| - num_iterations: 2 |
| - beta: 0.01 |
|
|
| ## Limitations |
|
|
| - Optimized for geospatial/remote sensing imagery |
| - May not perform as well on general domain images |
| - Response quality depends on image resolution and clarity |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| @misc{earthmind-r1, |
| title={EarthMind-R1: GRPO Fine-tuned Vision-Language Model for Geospatial Understanding}, |
| author={Your Name}, |
| year={2024}, |
| publisher={HuggingFace} |
| } |
| ``` |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|