Improve model card: Add pipeline tag, library, paper, code links and detailed usage
Browse filesThis PR significantly enhances the model card for the `Visurf-7B-Best-on-gRefCOCO` model by:
- Adding `library_name: transformers` to enable automated code snippets for the Hugging Face `transformers` library, as evidenced by the existing usage example and `config.json`.
- Adding `pipeline_tag: image-text-to-text` for better model discoverability on the Hugging Face Hub, reflecting its nature as a Large Vision-and-Language Model.
- Including a link to the paper: [ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models](https://huggingface.co/papers/2510.10606).
- Adding a link to the official GitHub repository for code and further resources: https://github.com/dvlab-research/ViSurf.
- Populating the model card with a comprehensive overview (including the abstract and diagram), detailed installation instructions, inference examples, evaluation, training guidelines, and other relevant information directly from the project's GitHub README. This provides a rich and user-friendly documentation for the model.
Please review these additions and merge this PR.
|
@@ -6,9 +6,30 @@ tags:
|
|
| 6 |
- multimodal
|
| 7 |
- qwen
|
| 8 |
- visurf
|
|
|
|
|
|
|
| 9 |
---
|
| 10 |
|
| 11 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
```python
|
| 14 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
@@ -18,3 +39,172 @@ model_name = "Ricky06662/Visurf-7B-Best-on-gRefCOCO"
|
|
| 18 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 19 |
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
|
| 20 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
- multimodal
|
| 7 |
- qwen
|
| 8 |
- visurf
|
| 9 |
+
library_name: transformers
|
| 10 |
+
pipeline_tag: image-text-to-text
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
|
| 14 |
+
|
| 15 |
+
This repository contains the model presented in the paper [**ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models**](https://huggingface.co/papers/2510.10606).
|
| 16 |
+
|
| 17 |
+
**GitHub Repository**: https://github.com/dvlab-research/ViSurf
|
| 18 |
+
|
| 19 |
+
## Abstract
|
| 20 |
+
Typical post-training paradigms for Large Vision-and-Language Models (LVLMs) include Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR). SFT leverages external guidance to inject new knowledge, whereas RLVR utilizes internal reinforcement to enhance reasoning capabilities and overall performance. However, our analysis reveals that SFT often leads to sub-optimal performance, while RLVR struggles with tasks that exceed the model's internal knowledge base. To address these limitations, we propose ViSurf (\textbf{Vi}sual \textbf{Su}pervised-and-\textbf{R}einforcement \textbf{F}ine-Tuning), a unified post-training paradigm that integrates the strengths of both SFT and RLVR within a single stage. We analyze the derivation of the SFT and RLVR objectives to establish the ViSurf objective, providing a unified perspective on these two paradigms. The core of ViSurf involves injecting ground-truth labels into the RLVR rollouts, thereby providing simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to stabilize and optimize the training process. Extensive experiments across several diverse benchmarks demonstrate the effectiveness of ViSurf, outperforming both individual SFT, RLVR, and two-stage SFT \textrightarrow RLVR. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf.
|
| 21 |
+
|
| 22 |
+
## Overview of ViSurf
|
| 23 |
+
|
| 24 |
+
<div align=center>
|
| 25 |
+
<img width="98%" src="https://github.com/dvlab-research/ViSurf/raw/main/assets/overview.png"/>
|
| 26 |
+
</div>
|
| 27 |
+
|
| 28 |
+
ViSurf (**Vi**sual **Su**pervised-and-**R**einforcement **F**ine-Tuning) is a unified post-training paradigm that integrates the strengths of both SFT and RLVR within a single stage.
|
| 29 |
+
|
| 30 |
+
## Basic Usage with Transformers
|
| 31 |
+
|
| 32 |
+
This section demonstrates how to load the model using the Hugging Face `transformers` library.
|
| 33 |
|
| 34 |
```python
|
| 35 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
|
|
| 39 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 40 |
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
|
| 41 |
```
|
| 42 |
+
|
| 43 |
+
## News
|
| 44 |
+
|
| 45 |
+
[Oct. 12th, 2025] 🔥 ViSurf is coming! We have released the code and training data.
|
| 46 |
+
|
| 47 |
+
## Contents
|
| 48 |
+
- [Installation](#installation)
|
| 49 |
+
- [Inference](#inference)
|
| 50 |
+
- [Evaluation](#evaluation)
|
| 51 |
+
- [Training](#training)
|
| 52 |
+
- [Build Your Data](#build-your-own-training-data-optional)
|
| 53 |
+
- [Citation](#citation)
|
| 54 |
+
- [Acknowledgement](#acknowledgement)
|
| 55 |
+
|
| 56 |
+
## Installation
|
| 57 |
+
|
| 58 |
+
```bash
|
| 59 |
+
git clone https://github.com/dvlab-research/ViSurf.git
|
| 60 |
+
cd ViSurf
|
| 61 |
+
conda create -n visionreasoner python=3.12
|
| 62 |
+
conda activate visionreasoner
|
| 63 |
+
pip install -e .
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
## Inference
|
| 67 |
+
Download pretrained models using the following scripts:
|
| 68 |
+
```bash
|
| 69 |
+
mkdir pretrained_models
|
| 70 |
+
cd pretrained_models
|
| 71 |
+
git lfs install
|
| 72 |
+
git clone https://huggingface.co/Ricky06662/Visurf-7B-Best-on-gRefCOCO
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
> [!TIP]
|
| 76 |
+
> If you encounter issues with connecting to Hugging Face, consider using `export HF_ENDPOINT=https://hf-mirror.com`.
|
| 77 |
+
|
| 78 |
+
Then run inference using:
|
| 79 |
+
```bash
|
| 80 |
+
python inference_scripts/inference_visurf.py
|
| 81 |
+
```
|
| 82 |
+
The default question is
|
| 83 |
+
> "I want to rest, where should I sit?"
|
| 84 |
+
|
| 85 |
+
You will get the thinking process in command line, like:
|
| 86 |
+
|
| 87 |
+
> "The question seems to be asking where to sit, but the image only shows a kitchen counter with food and flowers."
|
| 88 |
+
|
| 89 |
+
And the mask will be presented in **inference_scripts** folder. In this case, there is no related object.
|
| 90 |
+
|
| 91 |
+
<div align=center>
|
| 92 |
+
<img width="98%" src="https://github.com/dvlab-research/ViSurf/raw/main/assets/test_output_1.png"/>
|
| 93 |
+
</div>
|
| 94 |
+
|
| 95 |
+
You can also try find objects in the image by:
|
| 96 |
+
```bash
|
| 97 |
+
python inference_scripts/inference_visurf.py --text "I want to cook food, what can I use?"
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
You will get the thinking process in command line, like:
|
| 101 |
+
|
| 102 |
+
> "The question asks what kitchen tools or ingredients are visible that could be used for cooking."
|
| 103 |
+
|
| 104 |
+
The mask will be presented in **inference_scripts** folder.
|
| 105 |
+
|
| 106 |
+
<div align=center>
|
| 107 |
+
<img width="98%" src="https://github.com/dvlab-research/ViSurf/raw/main/assets/test_output_2.png"/>
|
| 108 |
+
</div>
|
| 109 |
+
|
| 110 |
+
You can also provide your own image_path and text by:
|
| 111 |
+
```bash
|
| 112 |
+
python inference_scripts/inference_visurf.py --image_path "your_image_path" --text "your question text"
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
## Evaluation
|
| 116 |
+
|
| 117 |
+
Evaluation Data: [🤗 gRefCOCO val](https://huggingface.co/datasets/Ricky06662/grefcoco_val_all )
|
| 118 |
+
|
| 119 |
+
We recommend you to [VisionReasoner](https://github.com/dvlab-research/VisionReasoner) for evaluation on ViSurf.
|
| 120 |
+
|
| 121 |
+
> [!NOTE]
|
| 122 |
+
> In ViSurf, the best results on different benchmark are evaluated using different checkpoint. We only release best ckpt on gRefCOCO. For someone who may care about the performance, we suggest you can evaluate and compare the value in your environment.
|
| 123 |
+
|
| 124 |
+
## Training
|
| 125 |
+
|
| 126 |
+
### 1. ViSurf Training
|
| 127 |
+
|
| 128 |
+
Training Data: [🤗 ViSurf 7300](https://huggingface.co/datasets/Ricky06662/ViSurf_multi_non_object_7300_size840)
|
| 129 |
+
Download dataset using this script:
|
| 130 |
+
```bash
|
| 131 |
+
python training_scripts/download_dataset.py
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
> [!TIP]
|
| 135 |
+
> Try resize the image and re-calculate the corresponding bbox/point coordinates if you have lower GPU memory. Remember changing the corresponding resize_size in evaluation and inference.
|
| 136 |
+
|
| 137 |
+
Download pretrained models using the following scripts:
|
| 138 |
+
```bash
|
| 139 |
+
mkdir pretrained_models
|
| 140 |
+
cd pretrained_models
|
| 141 |
+
git lfs install
|
| 142 |
+
git clone https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct
|
| 143 |
+
```
|
| 144 |
+
|
| 145 |
+
(Optional) Start Ray in advance by:
|
| 146 |
+
```bash
|
| 147 |
+
ray start --head # or ray start --head -- port xxxx
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
Start training using this script:
|
| 151 |
+
```bash
|
| 152 |
+
bash training_scripts/qwen2_5vl_visurf_nonobj_7300.sh
|
| 153 |
+
```
|
| 154 |
+
|
| 155 |
+
You can try change the following hyper-parameters if you have a large GPU memory.
|
| 156 |
+
```bash
|
| 157 |
+
worker.actor.micro_batch_size_per_device_for_update=1 or 2 or 4 or 8 or 16 \
|
| 158 |
+
worker.actor.micro_batch_size_per_device_for_experience=1 or2 or 4 or 8 or 16 \
|
| 159 |
+
```
|
| 160 |
+
If your GPU has less memory, you can change the following config. The number is depend on your GPU memory.
|
| 161 |
+
```bash
|
| 162 |
+
worker.rollout.tensor_parallel_size=[your number between 1-4]
|
| 163 |
+
worker.rollout.gpu_memory_utilization=[your number between 0-1]
|
| 164 |
+
worker.rollout.n=[your number between 2-32]
|
| 165 |
+
```
|
| 166 |
+
|
| 167 |
+
### 2. Merge Checkpoint in Hugging Face Format
|
| 168 |
+
|
| 169 |
+
```bash
|
| 170 |
+
python3 training_scripts/model_merger.py --local_dir [path_to_your_actor_checkpoint]
|
| 171 |
+
```
|
| 172 |
+
|
| 173 |
+
## Build Your Own Training Data (Optional)
|
| 174 |
+
Please refer to [SegZero](https://github.com/dvlab-research/Seg-Zero) if you want to build your own dataset.
|
| 175 |
+
|
| 176 |
+
## Citation
|
| 177 |
+
|
| 178 |
+
```bibtex
|
| 179 |
+
@article{liu2025visurf,
|
| 180 |
+
title = {ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models},
|
| 181 |
+
author = {Liu, Yuqi and Chen, Liangyu and Liu, Jiazhen and Zhu, Mingkang and Zhong, Zhisheng and Yu, Bei and Jia, Jiaya},
|
| 182 |
+
journal = {arXiv preprint arXiv:2503.06520},
|
| 183 |
+
year = {2025}
|
| 184 |
+
}
|
| 185 |
+
|
| 186 |
+
|
| 187 |
+
@article{liu2025segzero,
|
| 188 |
+
title = {Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement},
|
| 189 |
+
author = {Liu, Yuqi and Peng, Bohao and Zhong, Zhisheng and Yue, Zihao and Lu, Fanbin and Yu, Bei and Jia, Jiaya},
|
| 190 |
+
journal = {arXiv preprint arXiv:2503.06520},
|
| 191 |
+
year = {2025}
|
| 192 |
+
}
|
| 193 |
+
|
| 194 |
+
@article{liu2025visionreasoner,
|
| 195 |
+
title = {VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning},
|
| 196 |
+
author = {Liu, Yuqi and Qu, Tianyuan and Zhong, Zhisheng and Peng, Bohao and Liu, Shu and Yu, Bei and Jia, Jiaya},
|
| 197 |
+
journal = {arXiv preprint arXiv:2505.12081},
|
| 198 |
+
year = {2025}
|
| 199 |
+
}
|
| 200 |
+
```
|
| 201 |
+
|
| 202 |
+
## Acknowledgement
|
| 203 |
+
We would like to thank the following repos for their great work:
|
| 204 |
+
|
| 205 |
+
- This work is built upon the [EasyR1](https://github.com/hiyouga/EasyR1) and [veRL](https://github.com/volcengine/verl).
|
| 206 |
+
- This work utilizes models from [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct), [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) and [SAM2](https://huggingface.co/facebook/sam2-hiera-large).
|
| 207 |
+
|
| 208 |
+
## Star History
|
| 209 |
+
|
| 210 |
+
[](https://star-history.com/#dvlab-research/ViSurf&Date)
|