File size: 20,646 Bytes
380b9ae ebfdd51 380b9ae 4e851ed 380b9ae da54968 380b9ae ebfdd51 380b9ae |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 |
---
license: apache-2.0
datasets:
- OpenGVLab/ScaleCUA-Data
language:
- en
metrics:
- accuracy
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- agent
---
# SCALECUA: SCALING UP COMPUTER USE AGENTS WITH CROSS-PLATFORM DATA
[\[๐ GitHub\]](https://github.com/OpenGVLab/ScaleCUA) [\[๐ Paper\]](https://github.com/OpenGVLab/ScaleCUA) [\[๐ Quick Start\]](#model-loading)
## Introduction
Recent advances in Vision-Language Models have enabled the development of agents capable of automating interactions with graphical user interfaces. Some computer use agents demonstrate strong performance, while they are typically built on closed-source models or inaccessible proprietary datasets. Moreover, the existing open-source datasets still remain insufficient for developing cross-platform general-purpose computer-use agents. To bridge this gap, we scale up the computer use dataset, constructed via a novel dual-loop interactive pipeline that combines an automated agent and a human expert into data collection. It spans **6 operating systems** and **3 task domains**, offering a large-scale and diverse corpus for training computer use agents.
Building on this corpus, we develop **ScaleCUA**, capable of seamless operation across heterogeneous platforms. Trained on our dataset, it delivers consistent gains on several benchmarks, improving absolute success rates by **+26.6 points** on WebArena-Lite-v2 and **+10.7 points** on ScreenSpot-Pro compared to the baseline. Moreover, our ScaleCUA family achieves state-of-the-art performance across multiple benchmarks, e.g., **94.4%** on MMBench-GUI L1-Hard, **60.6%** on OSWorld-G and **47.4%** on WebArena-Lite-v2. These results highlight the effectiveness of our data-centric methodology in scaling both GUI understanding, grounding, and cross-platform task completion. We make our data, models, and code publicly available to facilitate future research: https://github.com/OpenGVLab/ScaleCUA.

---
## Model Loading
We provide an example code to run `ScaleCUA` using `transformers`.
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"OpenGVLab/ScaleCUA-7B", torch_dtype="auto", device_map="auto"
)
min_pixels = 3136
max_pixels = 2109744
processor = AutoProcessor.from_pretrained("OpenGVLab/ScaleCUA-7B", min_pixels=min_pixels, max_pixels=max_pixels)
````
## Direct Action Mode as grounder
For tasks that require direct GUI grounding (e.g., identifying and clicking a specific button from a description) or serve as grounder in agentic workflow, you can use the **Direct Action Mode**. This mode focuses on generating immediate, executable actions based on the visual input.
1. To enable this mode, set the system prompt as follows:
```python
SCALECUA_SYSTEM_PROMPT_GROUNDER = '''You are an autonomous GUI agent capable of operating on desktops, mobile devices, and web browsers. Your primary function is to analyze screen captures and perform appropriate UI actions to complete assigned tasks.
## Action Space
def click(
x: float | None = None,
y: float | None = None,
clicks: int = 1,
button: str = "left",
) -> None:
"""Clicks on the screen at the specified coordinates. The `x` and `y` parameter specify where the mouse event occurs. If not provided, the current mouse position is used. The `clicks` parameter specifies how many times to click, and the `button` parameter specifies which mouse button to use ('left', 'right', or 'middle')."""
pass
def doubleClick(
x: float | None = None,
y: float | None = None,
button: str = "left",
) -> None:
"""Performs a double click. This is a wrapper function for click(x, y, 2, 'left')."""
pass
def rightClick(x: float | None = None, y: float | None = None) -> None:
"""Performs a right mouse button click. This is a wrapper function for click(x, y, 1, 'right')."""
pass
def moveTo(x: float, y: float) -> None:
"""Move the mouse to the specified coordinates."""
pass
def dragTo(
x: float | None = None, y: float | None = None, button: str = "left"
) -> None:
"""Performs a drag-to action with optional `x` and `y` coordinates and button."""
pass
def swipe(
from_coord: tuple[float, float] | None = None,
to_coord: tuple[float, float] | None = None,
direction: str = "up",
amount: float = 0.5,
) -> None:
"""Performs a swipe action on the screen. The `from_coord` and `to_coord` specify the starting and ending coordinates of the swipe. If `to_coord` is not provided, the `direction` and `amount` parameters are used to determine the swipe direction and distance. The `direction` can be 'up', 'down', 'left', or 'right', and the `amount` specifies how far to swipe relative to the screen size (0 to 1)."""
pass
def long_press(x: float, y: float, duration: int = 1) -> None:
"""Long press on the screen at the specified coordinates. The `duration` specifies how long to hold the press in seconds."""
pass
## Input Specification
- Screenshot of the current screen + task description
## Output Format
<action>
[A set of executable action command]
</action>
## Note
- Avoid action(s) that would lead to invalid states.
- The generated action(s) must exist within the defined action space.
- The generated action(s) should be enclosed within <action></action> tags.'''
```
2. Use the above system prompt to generate prediction:
```python
low_level_instruction = "Click the 'X' button in the upper right corner of the pop-up to close it and access the car selection options."
messages = [
{
"role": "system",
"content":[
{
"type": "text",
"text": SCALECUA_SYSTEM_PROMPT_GROUNDER,
}
]
},
{
"role": "user",
"content": [
{
"type": "image",
"image": "/path/to/your/image",
},
{"type": "text", "text": low_level_instruction},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
3. Extract coordinates and transform it based on the resized image:
```python
from qwen_vl_utils import smart_resize
def parse_scalecua_grounder_response(response, image_width: int, image_height: int, resized_width: int, resized_height: int) -> List[str]:
response = response.strip()
logger.info(f"Extracting coordinates from: {response}")
match = re.search(r"\((\d+),\s*(\d+)\)", response)
if not match:
pattern = r'\((?:x=)?([-+]?\d*\.\d+|\d+)(?:,\s*(?:y=)?([-+]?\d*\.\d+|\d+))?\)'
match = re.search(pattern, response)
x = int(float(match.group(1)) / resized_width * width)
y = int(float(match.group(2)) / resized_height * height) if match.group(2) else None
if y is not None:
return (x, y)
resize_h, resize_w = smart_resize(image_height, image_width, min_pixels=min_pixels, max_pixels=max_pixels)
x, y = parse_scalecua_grounder_response(output_text, image_width, image_height, resize_w, resize_h)
```
-----
## Reasoned Action Mode as native agents
For complex, multi-step tasks, the **Reasoned Action Mode** guides the model to first think through the problem, state its intended operation, and then generate the corresponding action code. This is the recommended mode for general computer use automation. We will demonstrate an example of ScalueCUA in Ubuntu OS๏ผ
1. To enable this mode, use the following system prompt:
```python
SCALECUA_SYSTEM_PROMPT_AGENT = '''You are an autonomous GUI agent operating on the **Linux (Ubuntu)** platform. Your primary function is to analyze screen captures and perform appropriate UI actions to complete assigned tasks.
## Action Space
def click(
x: float | None = None,
y: float | None = None,
clicks: int = 1,
button: str = "left",
) -> None:
"""Clicks on the screen at the specified coordinates. The `x` and `y` parameter specify where the mouse event occurs. If not provided, the current mouse position is used. The `clicks` parameter specifies how many times to click, and the `button` parameter specifies which mouse button to use ('left', 'right', or 'middle')."""
pass
def doubleClick(
x: float | None = None,
y: float | None = None,
button: str = "left",
) -> None:
"""Performs a double click. This is a wrapper function for click(x, y, 2, 'left')."""
pass
def rightClick(x: float | None = None, y: float | None = None) -> None:
"""Performs a right mouse button click. This is a wrapper function for click(x, y, 1, 'right')."""
pass
def scroll(clicks: int, x: float | None = None, y: float | None = None) -> None:
"""Performs a scroll of the mouse scroll wheel at the specified coordinates. The `clicks` specifies how many clicks to scroll. The direction of the scroll (vertical or horizontal) depends on the underlying operating system. Normally, positive values scroll up, and negative values scroll down."""
pass
def moveTo(x: float, y: float) -> None:
"""Move the mouse to the specified coordinates."""
pass
def dragTo(
x: float | None = None, y: float | None = None, button: str = "left"
) -> None:
"""Performs a drag-to action with optional `x` and `y` coordinates and button."""
pass
def press(keys: str | list[str], presses: int = 1) -> None:
"""Performs a keyboard key press down, followed by a release. The function supports pressing a single key or a list of keys, multiple presses, and customizable intervals between presses."""
pass
def hotkey(*args: str) -> None:
"""Performs key down presses on the arguments passed in order, then performs key releases in reverse order. This is used to simulate keyboard shortcuts (e.g., 'Ctrl-Shift-C')."""
pass
def keyDown(key: str) -> None:
"""Performs a keyboard key press without the release. This will put that key in a held down state."""
pass
def keyUp(key: str) -> None:
"""Performs a keyboard key release (without the press down beforehand)."""
pass
def write(message: str) -> None:
"""Write the specified text."""
pass
def call_user() -> None:
"""Call the user."""
pass
def wait(seconds: int = 3) -> None:
"""Wait for the change to happen."""
pass
def response(answer: str) -> None:
"""Answer a question or provide a response to an user query."""
pass
def terminate(status: str = "success", info: str | None = None) -> None:
"""Terminate the current task with a status. The `status` specifies the termination status ('success', 'failure'), and the `info` can provide additional information about the termination."""
pass
## Input Specification
- Screenshot of the current screen + task description + your past interaction history with UI to finish assigned tasks.
## Output Format
<think>
[Your reasoning process here]
</think>
<operation>
[Next intended operation description]
</operation>
<action>
[A set of executable action command]
</action>
## Note
- Avoid actions that would lead to invalid states.
- The generated action(s) must exist within the defined action space.
- The reasoning process, operation and action(s) in your response should be enclosed within <think></think>, <operation></operation> and <action></action> tags, respectively.'''
```
2. Use the above system prompt to generate prediction:
```python
SCALECUA_USER_PROMPT = '''Please generate the next move according to the UI screenshot, the task and previous operations.
Task:
{instruction}
Previous operations:
{history}
'''
def format_history(history):
if len(history) > 0:
actions_history = [f"Step {i+1}: {low_level}" for i, low_level in enumerate(history)]
return "\n".join(actions_history)
else:
return None
history = ["Click on 'Chrome'", "Click on the three-dot menu icon in the top right corner of the Chrome window to open the browser settings menu."]
step_history = format_history(history)
task_instruction = "I want to check my password information in Chrome"
user_prompt = SCALECUA_USER_PROMPT.format(
instruction=task_instruction,
history=step_history,
)
messages = [
{
"role": "system",
"content":[
{
"type": "text",
"text": SCALECUA_SYSTEM_PROMPT_AGENT,
}
]
},
{
"role": "user",
"content": [
{
"type": "image",
"image": "/path/to/your/image",
},
{"type": "text", "text": user_prompt},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
3. Extract think, low-level instruction and actions from response:
```python
def parse_response(response: str) -> Dict:
action_matches = re.findall(r'<action>\s*(.*?)\s*</action>', response, re.DOTALL)
actions = []
if action_matches:
for match in action_matches:
# Split each match by newline and strip whitespace from each line
lines = [line.strip() for line in match.split('\n') if line.strip()]
actions.extend(lines)
operation_match = re.search(r'<operation>\s*(.*?)\s*</operation>', response, re.DOTALL)
operation = operation_match.group(1).strip() if operation_match else None
think_match = re.search(r'<think>\s*(.*?)\s*</think>', response, re.DOTALL)
think = think_match.group(1).strip() if think_match else None
return (think, operation, actions)
def parse_actions(self, actions):
parsed_action = []
for action in actions:
match = re.match(r"(\w+)\((.*)\)", action)
if not match:
return None
func_name = match.group(1)
args_str = match.group(2)
args = {}
if 'hotkey' in func_name.lower():
keys = re.findall(r"'(.*?)'", args_str)
keys = [key.lower() for key in keys]
args["args"] = keys
elif 'press' in func_name.lower():
keys = None
presses = 1
presses_match = re.search(r"presses\s*=\s*(\d+)", args_str)
if presses_match:
presses = int(presses_match.group(1))
args_str = args_str[:presses_match.start()] + args_str[presses_match.end():]
args_str = args_str.rstrip(", ").strip()
keys_keyword_match = re.search(r"keys\s*=\s*(.*)", args_str, re.DOTALL)
if keys_keyword_match:
keys_str = keys_keyword_match.group(1).strip()
if (keys_str.startswith("'") and keys_str.endswith("'")) or \
(keys_str.startswith('"') and keys_str.endswith('"')):
keys_str = keys_str[1:-1]
elif keys_str.startswith("[") and keys_str.endswith("]"):
keys_str = ast.literal_eval(keys_str)
keys = keys_str
elif args_str:
keys_str = args_str.strip()
if (keys_str.startswith("'") and keys_str.endswith("'")) or \
(keys_str.startswith('"') and keys_str.endswith('"')):
keys_str = keys_str[1:-1]
keys = keys_str
args["keys"] = keys
args["presses"] = presses
elif 'scroll' in func_name.lower():
clicks, x, y = None, None, None
if '=' in args_str:
kwargs = dict(re.findall(r'(\w+)\s*=\s*(-?\d+)', args_str))
clicks = int(kwargs.get('clicks')) if kwargs.get('clicks') is not None else None
x = int(kwargs.get('x')) if kwargs.get('x') is not None else None
y = int(kwargs.get('y')) if kwargs.get('y') is not None else None
elif args_str:
try:
clicks = int(args_str)
except ValueError:
pass
if clicks: args['clicks'] = clicks
if x: args['x'] = x
if y: args['y'] = y
else:
if "=" in args_str:
for arg in re.finditer(r"(\w+)=\[([^\]]+)\]", args_str):
param = arg.group(1)
list_str = arg.group(2)
list_items = []
for item in re.finditer(r"'([^']*)'|\"([^\"]*)\"|([^,\]]+)", list_str):
val = (item.group(1) or item.group(2) or item.group(3)).strip()
if val:
list_items.append(val.strip('"\''))
args[param] = list_items
for arg in re.finditer(r"(\w+)=([^,)]+)", args_str):
param = arg.group(1)
if param in args:
continue
value_str = arg.group(2).strip()
if value_str.isdigit():
value = int(value_str)
elif value_str.replace(".", "", 1).isdigit():
value = float(value_str)
elif value_str.lower() in ("true", "false"):
value = value_str.lower() == "true"
else:
value = value_str.strip('"\'')
args[param] = value
else:
args_list = []
for arg in re.finditer(r"'([^']*)'|\"([^\"]*)\"|([^,]+)", args_str):
val = (arg.group(1) or arg.group(2) or arg.group(3)).strip()
if val:
args_list.append(val.strip('"\''))
if args_list:
args["args"] = args_list
parsed_action.append({
'name': func_name,
'parameters': args
})
think, operation, actions = parse_response(output_text)
structured_actions = parse_actions(actions)
```
4. Transform coordinates based on the resized image:
```python
from qwen_vl_utils import smart_resize
resize_h, resize_w = smart_resize(image_height, image_width, min_pixels=min_pixels, max_pixels=max_pixels)
for action in actions
if 'x' in action['parameters'] :
x = "{:.4f}".format(float(x) / resize_w * image_width)
action['parameters']['x'] = x
if 'y' in action['parameters']
y = "{:.4f}".format(float(y) / resize_h * image_height)
action['parameters']['y'] = y
```
-----
## Citation
If you find our project useful in your research, please consider citing:
```bibtex
@article{liu2025scalecua,
title = {ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data},
author = {Liu, Zhaoyang and Xie, Jingjing and Ding, Zichen and Li, Zehao and Yang, Bowen and Wu, Zhenyu and Wang, Xuehui and Sun, Qiushi and Liu, Shi and Wang, Weiyun and Ye, Shenglong and Li, Qingyun and Dong, Xuan and Yu, Yue and Lu, Chenyu and Mo, YunXiang and Yan, Yao and Tian, Zeyue and Zhang, Xiao and Huang, Yuan and Liu, Yiqian and Su, Weijie and Luo, Gen and Yue, Xiangyu and Qi, Biqing and Chen, Kai and Zhou, Bowen and Qiao, Yu and Chen, Qifeng and Wang, Wenhai},
year = {2025},
url = {https://github.com/OpenGVLab/ScaleCUA}
}
``` |