File size: 20,646 Bytes
380b9ae
 
 
 
 
 
 
 
 
ebfdd51
380b9ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4e851ed
380b9ae
 
 
 
da54968
380b9ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ebfdd51
 
 
 
380b9ae
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
---
license: apache-2.0
datasets:
- OpenGVLab/ScaleCUA-Data
language:
- en
metrics:
- accuracy
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- agent
---

# SCALECUA: SCALING UP COMPUTER USE AGENTS WITH CROSS-PLATFORM DATA

[\[๐Ÿ“‚ GitHub\]](https://github.com/OpenGVLab/ScaleCUA) [\[๐Ÿ“œ Paper\]](https://github.com/OpenGVLab/ScaleCUA) [\[๐Ÿš€ Quick Start\]](#model-loading)



## Introduction

Recent advances in Vision-Language Models have enabled the development of agents capable of automating interactions with graphical user interfaces. Some computer use agents demonstrate strong performance, while they are typically built on closed-source models or inaccessible proprietary datasets. Moreover, the existing open-source datasets still remain insufficient for developing cross-platform general-purpose computer-use agents. To bridge this gap, we scale up the computer use dataset, constructed via a novel dual-loop interactive pipeline that combines an automated agent and a human expert into data collection. It spans **6 operating systems** and **3 task domains**, offering a large-scale and diverse corpus for training computer use agents. 
Building on this corpus, we develop **ScaleCUA**, capable of seamless operation across heterogeneous platforms. Trained on our dataset, it delivers consistent gains on several benchmarks, improving absolute success rates by **+26.6 points** on WebArena-Lite-v2 and **+10.7 points** on ScreenSpot-Pro compared to the baseline. Moreover, our ScaleCUA family achieves state-of-the-art performance across multiple benchmarks, e.g., **94.4%** on MMBench-GUI L1-Hard, **60.6%** on OSWorld-G and **47.4%** on WebArena-Lite-v2. These results highlight the effectiveness of our data-centric methodology in scaling both GUI understanding, grounding, and cross-platform task completion. We make our data, models, and code publicly available to facilitate future research: https://github.com/OpenGVLab/ScaleCUA.

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6502f241b1792803da7e8def/YdK0I790ehLAKpR1vGkX1.png)

---

## Model Loading

We provide an example code to run `ScaleCUA` using `transformers`.
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "OpenGVLab/ScaleCUA-7B", torch_dtype="auto", device_map="auto"
)

min_pixels = 3136
max_pixels = 2109744
processor = AutoProcessor.from_pretrained("OpenGVLab/ScaleCUA-7B", min_pixels=min_pixels, max_pixels=max_pixels)
````

## Direct Action Mode as grounder

For tasks that require direct GUI grounding (e.g., identifying and clicking a specific button from a description) or serve as grounder in agentic workflow, you can use the **Direct Action Mode**. This mode focuses on generating immediate, executable actions based on the visual input.

1. To enable this mode, set the system prompt as follows:
```python
SCALECUA_SYSTEM_PROMPT_GROUNDER = '''You are an autonomous GUI agent capable of operating on desktops, mobile devices, and web browsers. Your primary function is to analyze screen captures and perform appropriate UI actions to complete assigned tasks.

## Action Space
def click(
x: float | None = None,
y: float | None = None,
clicks: int = 1,
button: str = "left",
) -> None:
"""Clicks on the screen at the specified coordinates. The `x` and `y` parameter specify where the mouse event occurs. If not provided, the current mouse position is used. The `clicks` parameter specifies how many times to click, and the `button` parameter specifies which mouse button to use ('left', 'right', or 'middle')."""
pass

def doubleClick(
x: float | None = None,
y: float | None = None,
button: str = "left",
) -> None:
"""Performs a double click. This is a wrapper function for click(x, y, 2, 'left')."""
pass

def rightClick(x: float | None = None, y: float | None = None) -> None:
"""Performs a right mouse button click. This is a wrapper function for click(x, y, 1, 'right')."""
pass

def moveTo(x: float, y: float) -> None:
"""Move the mouse to the specified coordinates."""
pass

def dragTo(
x: float | None = None, y: float | None = None, button: str = "left"
) -> None:
"""Performs a drag-to action with optional `x` and `y` coordinates and button."""
pass

def swipe(
from_coord: tuple[float, float] | None = None,
to_coord: tuple[float, float] | None = None,
direction: str = "up",
amount: float = 0.5,
) -> None:
"""Performs a swipe action on the screen. The `from_coord` and `to_coord` specify the starting and ending coordinates of the swipe. If `to_coord` is not provided, the `direction` and `amount` parameters are used to determine the swipe direction and distance. The `direction` can be 'up', 'down', 'left', or 'right', and the `amount` specifies how far to swipe relative to the screen size (0 to 1)."""
pass

def long_press(x: float, y: float, duration: int = 1) -> None:
"""Long press on the screen at the specified coordinates. The `duration` specifies how long to hold the press in seconds."""
pass

## Input Specification
- Screenshot of the current screen + task description

## Output Format
<action>
[A set of executable action command]
</action>

## Note
- Avoid action(s) that would lead to invalid states.
- The generated action(s) must exist within the defined action space.
- The generated action(s) should be enclosed within <action></action> tags.'''
```
2. Use the above system prompt to generate prediction:
```python
low_level_instruction = "Click the 'X' button in the upper right corner of the pop-up to close it and access the car selection options."

messages = [
    {
      "role": "system",
      "content":[
        {
          "type": "text",
          "text": SCALECUA_SYSTEM_PROMPT_GROUNDER,
        }
      ]
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "/path/to/your/image",
            },
            {"type": "text", "text": low_level_instruction},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
3. Extract coordinates and transform it based on the resized image:
```python
from qwen_vl_utils import smart_resize

def parse_scalecua_grounder_response(response, image_width: int, image_height: int, resized_width: int, resized_height: int) -> List[str]:
    response = response.strip()
    logger.info(f"Extracting coordinates from: {response}")
    match = re.search(r"\((\d+),\s*(\d+)\)", response)
    if not match:
        pattern = r'\((?:x=)?([-+]?\d*\.\d+|\d+)(?:,\s*(?:y=)?([-+]?\d*\.\d+|\d+))?\)'
        match = re.search(pattern, response)
    x = int(float(match.group(1))  / resized_width * width)
    y = int(float(match.group(2)) / resized_height * height) if match.group(2) else None 
    if y is not None:
        return (x, y)


resize_h, resize_w = smart_resize(image_height, image_width, min_pixels=min_pixels, max_pixels=max_pixels)
x, y = parse_scalecua_grounder_response(output_text, image_width, image_height, resize_w, resize_h)
```

-----

## Reasoned Action Mode as native agents

For complex, multi-step tasks, the **Reasoned Action Mode** guides the model to first think through the problem, state its intended operation, and then generate the corresponding action code. This is the recommended mode for general computer use automation. We will demonstrate an example of ScalueCUA in Ubuntu OS๏ผš

1. To enable this mode, use the following system prompt:

```python
SCALECUA_SYSTEM_PROMPT_AGENT = '''You are an autonomous GUI agent operating on the **Linux (Ubuntu)** platform. Your primary function is to analyze screen captures and perform appropriate UI actions to complete assigned tasks.

## Action Space
def click(
    x: float | None = None,
    y: float | None = None,
    clicks: int = 1,
    button: str = "left",
) -> None:
    """Clicks on the screen at the specified coordinates. The `x` and `y` parameter specify where the mouse event occurs. If not provided, the current mouse position is used. The `clicks` parameter specifies how many times to click, and the `button` parameter specifies which mouse button to use ('left', 'right', or 'middle')."""
    pass


def doubleClick(
    x: float | None = None,
    y: float | None = None,
    button: str = "left",
) -> None:
    """Performs a double click. This is a wrapper function for click(x, y, 2, 'left')."""
    pass


def rightClick(x: float | None = None, y: float | None = None) -> None:
    """Performs a right mouse button click. This is a wrapper function for click(x, y, 1, 'right')."""
    pass


def scroll(clicks: int, x: float | None = None, y: float | None = None) -> None:
    """Performs a scroll of the mouse scroll wheel at the specified coordinates. The `clicks` specifies how many clicks to scroll. The direction of the scroll (vertical or horizontal) depends on the underlying operating system. Normally, positive values scroll up, and negative values scroll down."""
    pass


def moveTo(x: float, y: float) -> None:
    """Move the mouse to the specified coordinates."""
    pass


def dragTo(
    x: float | None = None, y: float | None = None, button: str = "left"
) -> None:
    """Performs a drag-to action with optional `x` and `y` coordinates and button."""
    pass


def press(keys: str | list[str], presses: int = 1) -> None:
    """Performs a keyboard key press down, followed by a release. The function supports pressing a single key or a list of keys, multiple presses, and customizable intervals between presses."""
    pass


def hotkey(*args: str) -> None:
    """Performs key down presses on the arguments passed in order, then performs key releases in reverse order. This is used to simulate keyboard shortcuts (e.g., 'Ctrl-Shift-C')."""
    pass


def keyDown(key: str) -> None:
    """Performs a keyboard key press without the release. This will put that key in a held down state."""
    pass


def keyUp(key: str) -> None:
    """Performs a keyboard key release (without the press down beforehand)."""
    pass


def write(message: str) -> None:
    """Write the specified text."""
    pass


def call_user() -> None:
    """Call the user."""
    pass


def wait(seconds: int = 3) -> None:
    """Wait for the change to happen."""
    pass


def response(answer: str) -> None:
    """Answer a question or provide a response to an user query."""
    pass


def terminate(status: str = "success", info: str | None = None) -> None:
    """Terminate the current task with a status. The `status` specifies the termination status ('success', 'failure'), and the `info` can provide additional information about the termination."""
    pass


## Input Specification
- Screenshot of the current screen + task description + your past interaction history with UI to finish assigned tasks.

## Output Format
<think>
[Your reasoning process here]
</think>
<operation>
[Next intended operation description]
</operation>
<action>
[A set of executable action command]
</action>

## Note
- Avoid actions that would lead to invalid states.
- The generated action(s) must exist within the defined action space.
- The reasoning process, operation and action(s) in your response should be enclosed within <think></think>, <operation></operation> and <action></action> tags, respectively.'''
```

2. Use the above system prompt to generate prediction:
```python
SCALECUA_USER_PROMPT = '''Please generate the next move according to the UI screenshot, the task and previous operations.

Task:
{instruction}

Previous operations:
{history}
'''

def format_history(history):
    if len(history) > 0:
        actions_history = [f"Step {i+1}: {low_level}" for i, low_level in enumerate(history)]
        return "\n".join(actions_history) 
    else:
        return None

history = ["Click on 'Chrome'", "Click on the three-dot menu icon in the top right corner of the Chrome window to open the browser settings menu."]
step_history = format_history(history)

task_instruction = "I want to check my password information in Chrome"
user_prompt = SCALECUA_USER_PROMPT.format(
    instruction=task_instruction,
    history=step_history,
)


messages = [
    {
      "role": "system",
      "content":[
        {
          "type": "text",
          "text": SCALECUA_SYSTEM_PROMPT_AGENT,
        }
      ]
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "/path/to/your/image",
            },
            {"type": "text", "text": user_prompt},
        ],
    }
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```

3. Extract think, low-level instruction and actions from response:
```python
def parse_response(response: str) -> Dict:
    action_matches = re.findall(r'<action>\s*(.*?)\s*</action>', response, re.DOTALL)
    actions = []
    if action_matches:
        for match in action_matches:
            # Split each match by newline and strip whitespace from each line
            lines = [line.strip() for line in match.split('\n') if line.strip()]
            actions.extend(lines)
    operation_match = re.search(r'<operation>\s*(.*?)\s*</operation>', response, re.DOTALL)
    operation = operation_match.group(1).strip() if operation_match else None

    think_match = re.search(r'<think>\s*(.*?)\s*</think>', response, re.DOTALL)
    think = think_match.group(1).strip() if think_match else None

    return (think, operation, actions)

def parse_actions(self, actions):
    parsed_action = []
    for action in actions:
        match = re.match(r"(\w+)\((.*)\)", action)
        if not match:
            return None

        func_name = match.group(1)
        args_str = match.group(2)
        args = {}

        if 'hotkey' in func_name.lower():
            keys = re.findall(r"'(.*?)'", args_str)
            keys = [key.lower() for key in keys]
            args["args"] = keys
        elif 'press' in func_name.lower():
            keys = None
            presses = 1  
            presses_match = re.search(r"presses\s*=\s*(\d+)", args_str)
            if presses_match:
                presses = int(presses_match.group(1))
                args_str = args_str[:presses_match.start()] + args_str[presses_match.end():]
                args_str = args_str.rstrip(", ").strip()

            keys_keyword_match = re.search(r"keys\s*=\s*(.*)", args_str, re.DOTALL)
            if keys_keyword_match:
                keys_str = keys_keyword_match.group(1).strip()
                if (keys_str.startswith("'") and keys_str.endswith("'")) or \
                (keys_str.startswith('"') and keys_str.endswith('"')):
                    keys_str = keys_str[1:-1]
                elif keys_str.startswith("[") and keys_str.endswith("]"):

                    keys_str = ast.literal_eval(keys_str)
                keys = keys_str
            elif args_str:
                keys_str = args_str.strip()
                if (keys_str.startswith("'") and keys_str.endswith("'")) or \
                (keys_str.startswith('"') and keys_str.endswith('"')):
                    keys_str = keys_str[1:-1]
                keys = keys_str

            args["keys"] = keys
            args["presses"] = presses
        elif 'scroll' in func_name.lower():
            clicks, x, y = None, None, None
            if '=' in args_str:
                kwargs = dict(re.findall(r'(\w+)\s*=\s*(-?\d+)', args_str))
                
                clicks = int(kwargs.get('clicks')) if kwargs.get('clicks') is not None else None
                x = int(kwargs.get('x')) if kwargs.get('x') is not None else None
                y = int(kwargs.get('y')) if kwargs.get('y') is not None else None
            
            elif args_str:
                try:
                    clicks = int(args_str)
                except ValueError:
                    pass

            if clicks: args['clicks'] = clicks
            if x: args['x'] = x
            if y: args['y'] = y

        else:
            if "=" in args_str:
                for arg in re.finditer(r"(\w+)=\[([^\]]+)\]", args_str):
                    param = arg.group(1)
                    list_str = arg.group(2)
                    
                    list_items = []
                    for item in re.finditer(r"'([^']*)'|\"([^\"]*)\"|([^,\]]+)", list_str):
                        val = (item.group(1) or item.group(2) or item.group(3)).strip()
                        if val:
                            list_items.append(val.strip('"\'')) 
                    
                    args[param] = list_items

                
                for arg in re.finditer(r"(\w+)=([^,)]+)", args_str):
                    param = arg.group(1)
                    if param in args:
                        continue
                    
                    value_str = arg.group(2).strip()
                    
                    if value_str.isdigit():
                        value = int(value_str)
                    elif value_str.replace(".", "", 1).isdigit():
                        value = float(value_str)
                    elif value_str.lower() in ("true", "false"):
                        value = value_str.lower() == "true"
                    else:
                        value = value_str.strip('"\'')
                    
                    args[param] = value

            
            else:
                args_list = []
                for arg in re.finditer(r"'([^']*)'|\"([^\"]*)\"|([^,]+)", args_str):
                    val = (arg.group(1) or arg.group(2) or arg.group(3)).strip()
                    if val:
                        args_list.append(val.strip('"\'')) 
                
                if args_list:
                    args["args"] = args_list

        parsed_action.append({
            'name': func_name,
            'parameters': args
        })

think, operation, actions = parse_response(output_text)
structured_actions = parse_actions(actions)
```

4. Transform coordinates based on the resized image:
```python
from qwen_vl_utils import smart_resize

resize_h, resize_w = smart_resize(image_height, image_width, min_pixels=min_pixels, max_pixels=max_pixels)
for action in actions
  if 'x' in action['parameters'] :
      x = "{:.4f}".format(float(x) / resize_w * image_width)
      action['parameters']['x'] = x
  if 'y' in action['parameters']
      y = "{:.4f}".format(float(y) / resize_h * image_height)     
      action['parameters']['y'] = y
```

-----

## Citation

If you find our project useful in your research, please consider citing:

```bibtex
@article{liu2025scalecua,
  title = {ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data},
  author = {Liu, Zhaoyang and Xie, Jingjing and Ding, Zichen and Li, Zehao and Yang, Bowen and Wu, Zhenyu and Wang, Xuehui and Sun, Qiushi and Liu, Shi and Wang, Weiyun and Ye, Shenglong and Li, Qingyun and Dong, Xuan and Yu, Yue and Lu, Chenyu and Mo, YunXiang and Yan, Yao and Tian, Zeyue and Zhang, Xiao and Huang, Yuan and Liu, Yiqian and Su, Weijie and Luo, Gen and Yue, Xiangyu and Qi, Biqing and Chen, Kai and Zhou, Bowen and Qiao, Yu and Chen, Qifeng and Wang, Wenhai},
  year = {2025},
  url = {https://github.com/OpenGVLab/ScaleCUA}
}
```