File size: 8,857 Bytes
c8a234a
 
 
77b7887
 
 
 
 
c8a234a
 
 
 
 
77b7887
c8a234a
 
 
3960674
c8a234a
 
 
 
3960674
 
30dd0db
 
3960674
c8a234a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c2ce70
 
c8a234a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e22f53b
 
 
 
 
c8a234a
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
base_model:
- ByteDance-Seed/UI-TARS-2B-SFT
datasets:
- OS-Copilot/OS-Atlas-data
license: mit
pipeline_tag: image-text-to-text
library_name: transformers
---

# GUI-Actor-Verifier-2B


This model was introduced in the paper [**GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents**](https://huggingface.co/papers/2506.03143).
It is developed based on [UI-TARS-2B-SFT](https://huggingface.co/ByteDance-Seed/UI-TARS-2B-SFT) and is designed to predict the correctness of an action position given a language instruction. This model is well-suited for **GUI-Actor**, as its attention map effectively provides diverse candidates for verification with only a single inference.


For more details on model design and evaluation, please check: [🏠 Project Page](https://microsoft.github.io/GUI-Actor/) | [πŸ’» Github Repo](https://github.com/microsoft/GUI-Actor) | [πŸ“‘ Paper](https://huggingface.co/papers/2506.03143).


| Model List                                  | Hugging Face Link                         |
|--------------------------------------------|--------------------------------------------|
| **GUI-Actor-7B-Qwen2-VL**                   | [πŸ€— Hugging Face](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2-VL)         |
| **GUI-Actor-2B-Qwen2-VL**                   | [πŸ€— Hugging Face](https://huggingface.co/microsoft/GUI-Actor-2B-Qwen2-VL)         |
| **GUI-Actor-7B-Qwen2.5-VL**   | [πŸ€— Hugging Face](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2.5-VL)       |
| **GUI-Actor-3B-Qwen2.5-VL**   | [πŸ€— Hugging Face](https://huggingface.co/microsoft/GUI-Actor-3B-Qwen2.5-VL)       |
| **GUI-Actor-Verifier-2B**                   | [πŸ€— Hugging Face](https://huggingface.co/microsoft/GUI-Actor-Verifier-2B)        |



## πŸ“Š Performance Comparison on GUI Grounding Benchmarks
Table 1. Main results on ScreenSpot-Pro, ScreenSpot, and ScreenSpot-v2 with **Qwen2-VL** as the backbone. † indicates scores obtained from our own evaluation of the official models on Huggingface.
| Method           | Backbone VLM | ScreenSpot-Pro | ScreenSpot | ScreenSpot-v2 |
|------------------|--------------|----------------|------------|----------------|
| **_72B models:_**
| AGUVIS-72B       | Qwen2-VL     | -              | 89.2       | -              |
| UGround-V1-72B   | Qwen2-VL     | 34.5           | **89.4**   | -              |
| UI-TARS-72B      | Qwen2-VL     | **38.1**       | 88.4       | **90.3**       |
| **_7B models:_**
| OS-Atlas-7B      | Qwen2-VL     | 18.9           | 82.5       | 84.1           |
| AGUVIS-7B        | Qwen2-VL     | 22.9           | 84.4       | 86.0†          |
| UGround-V1-7B    | Qwen2-VL     | 31.1           | 86.3       | 87.6†          |
| UI-TARS-7B       | Qwen2-VL     | 35.7           | 89.5   | **91.6**       |
| GUI-Actor-7B     | Qwen2-VL     | 40.7       | 88.3       | 89.5           |
| GUI-Actor-7B + Verifier     | Qwen2-VL    | **44.2**       | **89.7**       | 90.9           |
| **_2B models:_**
| UGround-V1-2B    | Qwen2-VL     | 26.6           | 77.1       | -              |
| UI-TARS-2B       | Qwen2-VL     | 27.7           | 82.3       | 84.7           |
| GUI-Actor-2B     | Qwen2-VL     | 36.7       | 86.5   | 88.6       |
| GUI-Actor-2B + Verifier     | Qwen2-VL    | **41.8**       | **86.9**       | **89.3**           |

Table 2. Main results on the ScreenSpot-Pro and ScreenSpot-v2 with **Qwen2.5-VL** as the backbone.
| Method         | Backbone VLM | ScreenSpot-Pro | ScreenSpot-v2 |
|----------------|---------------|----------------|----------------|
| **_7B models:_**
| Qwen2.5-VL-7B  | Qwen2.5-VL    | 27.6           | 88.8           |
| Jedi-7B        | Qwen2.5-VL    | 39.5           | 91.7           |
| GUI-Actor-7B   | Qwen2.5-VL    | 44.6      | 92.1       |
| GUI-Actor-7B + Verifier   | Qwen2.5-VL    | **47.7**       | **92.5**       |
| **_3B models:_**
| Qwen2.5-VL-3B  | Qwen2.5-VL    | 25.9           | 80.9           |
| Jedi-3B        | Qwen2.5-VL    | 36.1           | 88.6           |
| GUI-Actor-3B   | Qwen2.5-VL    | 42.2       | 91.0       |
| GUI-Actor-3B + Verifier   | Qwen2.5-VL    | **45.9**       | **92.4**       |

## πŸš€ Usage
The verifier takes a language instruction and an image with a red circle marking the target position as input. One example is shown below. It outputs either β€˜True’ or β€˜False’, and you can also use the probability of each label to score the sample.

For more detailed usage, please refer to our github repo.

<img src="https://cdn-uploads.huggingface.co/production/uploads/64d45451c34a346181b130dd/1LTBORYJsO9Ru6B4q_SKl.png" alt="image" width="500"/>


```python
import torch
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from transformers.generation import GenerationConfig
import json
import re
import os
import numpy as np
from PIL import Image, ImageDraw
from qwen_vl_utils import process_vision_info



# load model
model_name_or_path = "microsoft/GUI-Actor-Verifier-2B"
model = Qwen2VLForConditionalGeneration.from_pretrained(
            model_name_or_path, 
            device_map="cuda:0", 
            trust_remote_code=True, 
            torch_dtype=torch.bfloat16,
            attn_implementation="flash_attention_2"
        ).eval()
output_len = 1

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_name_or_path)

def draw_annotations(img, point_in_pixel, bbox, output_path='test.png', color='red', size=1):
    draw = ImageDraw.Draw(img)
    
    # Draw the ground truth bounding box in green
    if bbox:
        # Assuming bbox format is [x1, y1, x2, y2]
        draw.rectangle(bbox, outline="yellow", width=4)
    
    # Draw a small circle around the predicted point in red
    if point_in_pixel:
        # Create a small rectangle around the point (5 pixels in each direction)
        radius = np.ceil(8 * size).astype(int)
        circle_bbox = [
            point_in_pixel[0] - radius,  # x1
            point_in_pixel[1] - radius,  # y1
            point_in_pixel[0] + radius,  # x2
            point_in_pixel[1] + radius   # y2
        ]
        draw.ellipse(circle_bbox, outline=color, width=np.ceil(4 * size).astype(int))
    
    return img

def ground_only_positive(model, tokenizer, processor, instruction, image, point):
  if isinstance(image, str):
      image_path = image
      image = Image.open(image_path)
  else:
      image_path = image_to_temp_filename(image)
  assert os.path.exists(image_path) and os.path.isfile(image_path), "Invalid input image path."

  width, height = image.size
  image = draw_annotations(image, point, None, output_path=None, size=height/1000 * 1.2)

  prompt_origin = "Please observe the screenshot and exame whether the hollow red circle accurately placed on the intended position in the image: '{}'. Answer True or False."
  full_prompt = prompt_origin.format(instruction)

  messages = [
      {
          "role": "user",
          "content": [
              {
                  "type": "image",
                  "image": image,
              },
              {"type": "text", "text": full_prompt},
          ],
      }
  ]
  # Preparation for inference
  text_input = processor.apply_chat_template(
      messages, tokenize=False, add_generation_prompt=True
  )
  image_inputs, video_inputs = process_vision_info(messages)
  inputs = processor(
      text=[text_input],
      images=image_inputs,
      videos=video_inputs,
      padding=True,
      return_tensors="pt",
  )
  inputs = inputs.to("cuda:0")

  generated_ids = model.generate(
      **inputs,  
      max_new_tokens=output_len,
      do_sample=False,
      temperature=0.0
  )

  generated_ids_trimmed = [
      out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
  ]
  response = processor.batch_decode(
      generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
  )[0]

  print(response)
  matches = re.findall(r'\b(?:True|False)\b', response)
  if not len(matches):
      answer = 'Error Format'
  else:
      answer = matches[-1]
  return answer

# given the image path and instruction and coorindate
instruction = 'close this window'
image = Image.open('test.png')
width, height = image.size
point = [int(0.9709 * width), int(0.1548, * height)] # The point should be in pixels
answer = ground_only_positive(model, tokenizer, processor, instruction, image, point) # output True or False
```

## πŸ“ Citation
```
@article{wu2025gui,
  title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents},
  author={Wu, Qianhui and Cheng, Kanzhi and Yang, Rui and Zhang, Chaoyun and Yang, Jianwei and Jiang, Huiqiang and Mu, Jian and Peng, Baolin and Qiao, Bo and Tan, Reuben and others},
  journal={arXiv preprint arXiv:2506.03143},
  year={2025}
}
```