nielsr HF Staff commited on
Commit
6203dca
·
verified ·
1 Parent(s): ed36321

Improve model card: Add pipeline tag, library, paper, code links and detailed usage

Browse files

This PR significantly enhances the model card for the `Visurf-7B-Best-on-gRefCOCO` model by:

- Adding `library_name: transformers` to enable automated code snippets for the Hugging Face `transformers` library, as evidenced by the existing usage example and `config.json`.
- Adding `pipeline_tag: image-text-to-text` for better model discoverability on the Hugging Face Hub, reflecting its nature as a Large Vision-and-Language Model.
- Including a link to the paper: [ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models](https://huggingface.co/papers/2510.10606).
- Adding a link to the official GitHub repository for code and further resources: https://github.com/dvlab-research/ViSurf.
- Populating the model card with a comprehensive overview (including the abstract and diagram), detailed installation instructions, inference examples, evaluation, training guidelines, and other relevant information directly from the project's GitHub README. This provides a rich and user-friendly documentation for the model.

Please review these additions and merge this PR.

Files changed (1) hide show
  1. README.md +191 -1
README.md CHANGED
@@ -6,9 +6,30 @@ tags:
6
  - multimodal
7
  - qwen
8
  - visurf
 
 
9
  ---
10
 
11
- # Visurf Model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
  ```python
14
  from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -18,3 +39,172 @@ model_name = "Ricky06662/Visurf-7B-Best-on-gRefCOCO"
18
  tokenizer = AutoTokenizer.from_pretrained(model_name)
19
  model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
20
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  - multimodal
7
  - qwen
8
  - visurf
9
+ library_name: transformers
10
+ pipeline_tag: image-text-to-text
11
  ---
12
 
13
+ # ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
14
+
15
+ This repository contains the model presented in the paper [**ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models**](https://huggingface.co/papers/2510.10606).
16
+
17
+ **GitHub Repository**: https://github.com/dvlab-research/ViSurf
18
+
19
+ ## Abstract
20
+ Typical post-training paradigms for Large Vision-and-Language Models (LVLMs) include Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR). SFT leverages external guidance to inject new knowledge, whereas RLVR utilizes internal reinforcement to enhance reasoning capabilities and overall performance. However, our analysis reveals that SFT often leads to sub-optimal performance, while RLVR struggles with tasks that exceed the model's internal knowledge base. To address these limitations, we propose ViSurf (\textbf{Vi}sual \textbf{Su}pervised-and-\textbf{R}einforcement \textbf{F}ine-Tuning), a unified post-training paradigm that integrates the strengths of both SFT and RLVR within a single stage. We analyze the derivation of the SFT and RLVR objectives to establish the ViSurf objective, providing a unified perspective on these two paradigms. The core of ViSurf involves injecting ground-truth labels into the RLVR rollouts, thereby providing simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to stabilize and optimize the training process. Extensive experiments across several diverse benchmarks demonstrate the effectiveness of ViSurf, outperforming both individual SFT, RLVR, and two-stage SFT \textrightarrow RLVR. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf.
21
+
22
+ ## Overview of ViSurf
23
+
24
+ <div align=center>
25
+ <img width="98%" src="https://github.com/dvlab-research/ViSurf/raw/main/assets/overview.png"/>
26
+ </div>
27
+
28
+ ViSurf (**Vi**sual **Su**pervised-and-**R**einforcement **F**ine-Tuning) is a unified post-training paradigm that integrates the strengths of both SFT and RLVR within a single stage.
29
+
30
+ ## Basic Usage with Transformers
31
+
32
+ This section demonstrates how to load the model using the Hugging Face `transformers` library.
33
 
34
  ```python
35
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
39
  tokenizer = AutoTokenizer.from_pretrained(model_name)
40
  model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
41
  ```
42
+
43
+ ## News
44
+
45
+ [Oct. 12th, 2025] 🔥 ViSurf is coming! We have released the code and training data.
46
+
47
+ ## Contents
48
+ - [Installation](#installation)
49
+ - [Inference](#inference)
50
+ - [Evaluation](#evaluation)
51
+ - [Training](#training)
52
+ - [Build Your Data](#build-your-own-training-data-optional)
53
+ - [Citation](#citation)
54
+ - [Acknowledgement](#acknowledgement)
55
+
56
+ ## Installation
57
+
58
+ ```bash
59
+ git clone https://github.com/dvlab-research/ViSurf.git
60
+ cd ViSurf
61
+ conda create -n visionreasoner python=3.12
62
+ conda activate visionreasoner
63
+ pip install -e .
64
+ ```
65
+
66
+ ## Inference
67
+ Download pretrained models using the following scripts:
68
+ ```bash
69
+ mkdir pretrained_models
70
+ cd pretrained_models
71
+ git lfs install
72
+ git clone https://huggingface.co/Ricky06662/Visurf-7B-Best-on-gRefCOCO
73
+ ```
74
+
75
+ > [!TIP]
76
+ > If you encounter issues with connecting to Hugging Face, consider using `export HF_ENDPOINT=https://hf-mirror.com`.
77
+
78
+ Then run inference using:
79
+ ```bash
80
+ python inference_scripts/inference_visurf.py
81
+ ```
82
+ The default question is
83
+ > "I want to rest, where should I sit?"
84
+
85
+ You will get the thinking process in command line, like:
86
+
87
+ > "The question seems to be asking where to sit, but the image only shows a kitchen counter with food and flowers."
88
+
89
+ And the mask will be presented in **inference_scripts** folder. In this case, there is no related object.
90
+
91
+ <div align=center>
92
+ <img width="98%" src="https://github.com/dvlab-research/ViSurf/raw/main/assets/test_output_1.png"/>
93
+ </div>
94
+
95
+ You can also try find objects in the image by:
96
+ ```bash
97
+ python inference_scripts/inference_visurf.py --text "I want to cook food, what can I use?"
98
+ ```
99
+
100
+ You will get the thinking process in command line, like:
101
+
102
+ > "The question asks what kitchen tools or ingredients are visible that could be used for cooking."
103
+
104
+ The mask will be presented in **inference_scripts** folder.
105
+
106
+ <div align=center>
107
+ <img width="98%" src="https://github.com/dvlab-research/ViSurf/raw/main/assets/test_output_2.png"/>
108
+ </div>
109
+
110
+ You can also provide your own image_path and text by:
111
+ ```bash
112
+ python inference_scripts/inference_visurf.py --image_path "your_image_path" --text "your question text"
113
+ ```
114
+
115
+ ## Evaluation
116
+
117
+ Evaluation Data: [🤗 gRefCOCO val](https://huggingface.co/datasets/Ricky06662/grefcoco_val_all )
118
+
119
+ We recommend you to [VisionReasoner](https://github.com/dvlab-research/VisionReasoner) for evaluation on ViSurf.
120
+
121
+ > [!NOTE]
122
+ > In ViSurf, the best results on different benchmark are evaluated using different checkpoint. We only release best ckpt on gRefCOCO. For someone who may care about the performance, we suggest you can evaluate and compare the value in your environment.
123
+
124
+ ## Training
125
+
126
+ ### 1. ViSurf Training
127
+
128
+ Training Data: [🤗 ViSurf 7300](https://huggingface.co/datasets/Ricky06662/ViSurf_multi_non_object_7300_size840)
129
+ Download dataset using this script:
130
+ ```bash
131
+ python training_scripts/download_dataset.py
132
+ ```
133
+
134
+ > [!TIP]
135
+ > Try resize the image and re-calculate the corresponding bbox/point coordinates if you have lower GPU memory. Remember changing the corresponding resize_size in evaluation and inference.
136
+
137
+ Download pretrained models using the following scripts:
138
+ ```bash
139
+ mkdir pretrained_models
140
+ cd pretrained_models
141
+ git lfs install
142
+ git clone https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct
143
+ ```
144
+
145
+ (Optional) Start Ray in advance by:
146
+ ```bash
147
+ ray start --head # or ray start --head -- port xxxx
148
+ ```
149
+
150
+ Start training using this script:
151
+ ```bash
152
+ bash training_scripts/qwen2_5vl_visurf_nonobj_7300.sh
153
+ ```
154
+
155
+ You can try change the following hyper-parameters if you have a large GPU memory.
156
+ ```bash
157
+ worker.actor.micro_batch_size_per_device_for_update=1 or 2 or 4 or 8 or 16 \
158
+ worker.actor.micro_batch_size_per_device_for_experience=1 or2 or 4 or 8 or 16 \
159
+ ```
160
+ If your GPU has less memory, you can change the following config. The number is depend on your GPU memory.
161
+ ```bash
162
+ worker.rollout.tensor_parallel_size=[your number between 1-4]
163
+ worker.rollout.gpu_memory_utilization=[your number between 0-1]
164
+ worker.rollout.n=[your number between 2-32]
165
+ ```
166
+
167
+ ### 2. Merge Checkpoint in Hugging Face Format
168
+
169
+ ```bash
170
+ python3 training_scripts/model_merger.py --local_dir [path_to_your_actor_checkpoint]
171
+ ```
172
+
173
+ ## Build Your Own Training Data (Optional)
174
+ Please refer to [SegZero](https://github.com/dvlab-research/Seg-Zero) if you want to build your own dataset.
175
+
176
+ ## Citation
177
+
178
+ ```bibtex
179
+ @article{liu2025visurf,
180
+ title = {ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models},
181
+ author = {Liu, Yuqi and Chen, Liangyu and Liu, Jiazhen and Zhu, Mingkang and Zhong, Zhisheng and Yu, Bei and Jia, Jiaya},
182
+ journal = {arXiv preprint arXiv:2503.06520},
183
+ year = {2025}
184
+ }
185
+
186
+
187
+ @article{liu2025segzero,
188
+ title = {Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement},
189
+ author = {Liu, Yuqi and Peng, Bohao and Zhong, Zhisheng and Yue, Zihao and Lu, Fanbin and Yu, Bei and Jia, Jiaya},
190
+ journal = {arXiv preprint arXiv:2503.06520},
191
+ year = {2025}
192
+ }
193
+
194
+ @article{liu2025visionreasoner,
195
+ title = {VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning},
196
+ author = {Liu, Yuqi and Qu, Tianyuan and Zhong, Zhisheng and Peng, Bohao and Liu, Shu and Yu, Bei and Jia, Jiaya},
197
+ journal = {arXiv preprint arXiv:2505.12081},
198
+ year = {2025}
199
+ }
200
+ ```
201
+
202
+ ## Acknowledgement
203
+ We would like to thank the following repos for their great work:
204
+
205
+ - This work is built upon the [EasyR1](https://github.com/hiyouga/EasyR1) and [veRL](https://github.com/volcengine/verl).
206
+ - This work utilizes models from [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct), [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) and [SAM2](https://huggingface.co/facebook/sam2-hiera-large).
207
+
208
+ ## Star History
209
+
210
+ [![Star History Chart](https://api.star-history.com/svg?repos=dvlab-research/ViSurf&type=Date)](https://star-history.com/#dvlab-research/ViSurf&Date)