File size: 6,546 Bytes
febc622 6b0edc6 febc622 6b0edc6 da4710d febc622 6b0edc6 febc622 5cb0a88 496c8ef 5cb0a88 34e15c4 ae7947f 34e15c4 ae7947f 34e15c4 ae7947f 34e15c4 ae7947f 34e15c4 1b198a2 34e15c4 febc622 609d191 febc622 d622052 febc622 6b0edc6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
---
base_model:
- Qwen/Qwen2-VL-7B-Instruct
license: mit
library_name: transformers
pipeline_tag: image-text-to-text
---
# GUI-Actor-7B with Qwen2-VL-7B as backbone VLM
This model was introduced in the paper [**GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents**](https://huggingface.co/papers/2506.03143).
It is developed based on [Qwen2-VL-7B-Instruct ](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), augmented by an attention-based action head and finetuned to perform GUI grounding using the dataset [here](https://huggingface.co/datasets/cckevinn/GUI-Actor-Data).
For more details on model design and evaluation, please check: [π Project Page](https://microsoft.github.io/GUI-Actor/) | [π» Github Repo](https://github.com/microsoft/GUI-Actor) | [π Paper](https://www.arxiv.org/pdf/2506.03143).
| Model Name | Hugging Face Link |
|--------------------------------------------|--------------------------------------------|
| **GUI-Actor-7B-Qwen2-VL** | [π€ Hugging Face](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2-VL) |
| **GUI-Actor-2B-Qwen2-VL** | [π€ Hugging Face](https://huggingface.co/microsoft/GUI-Actor-2B-Qwen2-VL) |
| **GUI-Actor-7B-Qwen2.5-VL** | [π€ Hugging Face](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2.5-VL) |
| **GUI-Actor-3B-Qwen2.5-VL** | [π€ Hugging Face](https://huggingface.co/microsoft/GUI-Actor-3B-Qwen2.5-VL) |
| **GUI-Actor-Verifier-2B** | [π€ Hugging Face](https://huggingface.co/microsoft/GUI-Actor-Verifier-2B) |
## π Performance Comparison on GUI Grounding Benchmarks
Table 1. Main results on ScreenSpot-Pro, ScreenSpot, and ScreenSpot-v2 with **Qwen2-VL** as the backbone. β indicates scores obtained from our own evaluation of the official models on Huggingface.
| Method | Backbone VLM | ScreenSpot-Pro | ScreenSpot | ScreenSpot-v2 |
|------------------|--------------|----------------|------------|----------------|
| **_72B models:_**
| AGUVIS-72B | Qwen2-VL | - | 89.2 | - |
| UGround-V1-72B | Qwen2-VL | 34.5 | **89.4** | - |
| UI-TARS-72B | Qwen2-VL | **38.1** | 88.4 | **90.3** |
| **_7B models:_**
| OS-Atlas-7B | Qwen2-VL | 18.9 | 82.5 | 84.1 |
| AGUVIS-7B | Qwen2-VL | 22.9 | 84.4 | 86.0β |
| UGround-V1-7B | Qwen2-VL | 31.1 | 86.3 | 87.6β |
| UI-TARS-7B | Qwen2-VL | 35.7 | **89.5** | **91.6** |
| GUI-Actor-7B | Qwen2-VL | **40.7** | 88.3 | 89.5 |
| GUI-Actor-7B + Verifier | Qwen2-VL | 44.2 | 89.7 | 90.9 |
| **_2B models:_**
| UGround-V1-2B | Qwen2-VL | 26.6 | 77.1 | - |
| UI-TARS-2B | Qwen2-VL | 27.7 | 82.3 | 84.7 |
| GUI-Actor-2B | Qwen2-VL | **36.7** | **86.5** | **88.6** |
| GUI-Actor-2B + Verifier | Qwen2-VL | 41.8 | 86.9 | 89.3 |
Table 2. Main results on the ScreenSpot-Pro and ScreenSpot-v2 with **Qwen2.5-VL** as the backbone.
| Method | Backbone VLM | ScreenSpot-Pro | ScreenSpot-v2 |
|----------------|---------------|----------------|----------------|
| **_7B models:_**
| Qwen2.5-VL-7B | Qwen2.5-VL | 27.6 | 88.8 |
| Jedi-7B | Qwen2.5-VL | 39.5 | 91.7 |
| GUI-Actor-7B | Qwen2.5-VL | **44.6** | **92.1** |
| GUI-Actor-7B + Verifier | Qwen2.5-VL | 47.7 | 92.5 |
| **_3B models:_**
| Qwen2.5-VL-3B | Qwen2.5-VL | 25.9 | 80.9 |
| Jedi-3B | Qwen2.5-VL | 36.1 | 88.6 |
| GUI-Actor-3B | Qwen2.5-VL | **42.2** | **91.0** |
| GUI-Actor-3B + Verifier | Qwen2.5-VL | 45.9 | 92.4 |
## π Usage
```python
import torch
from qwen_vl_utils import process_vision_info
from datasets import load_dataset
from transformers import Qwen2VLProcessor
from gui_actor.constants import chat_template
from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer
from gui_actor.inference import inference
# load model
model_name_or_path = "microsoft/GUI-Actor-7B-Qwen2-VL"
data_processor = Qwen2VLProcessor.from_pretrained(model_name_or_path)
tokenizer = data_processor.tokenizer
model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained(
model_name_or_path,
torch_dtype=torch.bfloat16,
device_map="cuda:0",
attn_implementation="flash_attention_2"
).eval()
# prepare example
dataset = load_dataset("rootsautomation/ScreenSpot")["test"]
example = dataset[0]
print(f"Intruction: {example['instruction']}")
print(f"ground-truth action region (x1, y1, x2, y2): {[round(i, 2) for i in example['bbox']]}")
conversation = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are a GUI agent. You are given a task and a screenshot of the screen. You need to perform a series of pyautogui actions to complete the task.",
}
]
},
{
"role": "user",
"content": [
{
"type": "image",
"image": example["image"], # PIL.Image.Image or str to path
# "image_url": "https://xxxxx.png" or "https://xxxxx.jpg" or "file://xxxxx.png" or "", will be split by "base64,"
},
{
"type": "text",
"text": example["instruction"]
},
],
},
]
# inference
pred = inference(conversation, model, tokenizer, data_processor, use_placeholder=True, topk=3)
px, py = pred["topk_points"][0]
print(f"Predicted click point: [{round(px, 4)}, {round(py, 4)}]")
# >> Model Response
# Intruction: close this window
# ground-truth action region (x1, y1, x2, y2): [0.9479, 0.1444, 0.9938, 0.2074]
# Predicted click point: [0.9709, 0.1548]
```
## π Citation
```
@article{wu2025gui,
title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents},
author={Wu, Qianhui and Cheng, Kanzhi and Yang, Rui and Zhang, Chaoyun and Yang, Jianwei and Jiang, Huiqiang and Mu, Jian and Peng, Baolin and Qiao, Bo and Tan, Reuben and others},
journal={arXiv preprint arXiv:2506.03143},
year={2025}
}
``` |