Instructions to use ServiceNow/GroundNext-7B-V0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ServiceNow/GroundNext-7B-V0 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="ServiceNow/GroundNext-7B-V0")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("ServiceNow/GroundNext-7B-V0")
model = AutoModelForImageTextToText.from_pretrained("ServiceNow/GroundNext-7B-V0")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ServiceNow/GroundNext-7B-V0 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ServiceNow/GroundNext-7B-V0"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ServiceNow/GroundNext-7B-V0",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/ServiceNow/GroundNext-7B-V0

SGLang

How to use ServiceNow/GroundNext-7B-V0 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ServiceNow/GroundNext-7B-V0" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ServiceNow/GroundNext-7B-V0",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ServiceNow/GroundNext-7B-V0" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ServiceNow/GroundNext-7B-V0",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use ServiceNow/GroundNext-7B-V0 with Docker Model Runner:
```
docker model run hf.co/ServiceNow/GroundNext-7B-V0
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

GroundNext-7B-V0

🌐 Website | 📑 Paper | 🤗 Dataset | 🤖 Model

Highlights

GroundNext-7B-V0 is a state-of-the-art vision-language model for GUI element grounding, developed as part of the GroundCUA project. This model features:

Superior grounding accuracy achieving 52.9% on ScreenSpot-Pro, 67.7% on OSWorld-G, and 60.3% on UI-Vision benchmarks
Exceptional cross-platform generalization with 81.1% accuracy on MMBench-GUI and 90.4% on ScreenSpot-v2 despite desktop-only training
Data-efficient training achieving state-of-the-art results with only 700K training examples vs 9M+ in prior work
Strong agentic capabilities reaching 50.6% overall success rate on OSWorld when paired with reasoning models
Native tool-calling support with built-in computer use action space for mouse, keyboard, and screen interactions

Model Overview

GroundNext-7B-V0 has the following characteristics:

Type: Vision-Language Model for GUI Grounding
Base Model: Qwen2.5-VL-7B-Instruct
Training Approach: Two-stage (Supervised Fine-tuning + Reinforcement Learning with RLOO)
Number of Parameters: 7.0B
Training Data: 700K human-annotated desktop demonstrations from GroundCUA dataset
Context Length: 262,144 tokens (inherited from base model)
Specialization: Desktop GUI element grounding with cross-platform generalization

For more details about the training methodology, dataset, and comprehensive benchmarks, please refer to our paper, GitHub repository, and project website.

Performance

Desktop Grounding Benchmarks

	Qwen2.5-VL-7B	UI-TARS-72B	GroundNext-7B-V0
ScreenSpot-Pro	29.7	38.1	52.9
OSWorld-G	42.7	57.1	67.7
UI-Vision	16.5	25.5	60.3
Avg (Desktop)	29.6	40.2	60.3

Cross-Platform Generalization (Desktop, Mobile & Web)

	Qwen2.5-VL-7B	UI-TARS-72B	GroundNext-7B-V0
MMBench-GUI	33.9	74.3	81.1
ScreenSpot-v2	88.8	90.3	90.4
Avg (Mobile/Web)	61.4	82.3	85.8

Agentic Performance on OSWorld

When combined with OpenAI o3 for reasoning, GroundNext-7B-V0 demonstrates strong end-to-end computer use capabilities:

Model	OS	Office	Daily	Pro	Workflow	Overall
OpenAI o3	62.5	14.5	21.4	38.8	16.5	23.0
CUA	23.9	34.6	55.1	18.3	18.3	31.4
OpenCUA-72B	58.3	47.0	53.8	73.5	20.4	46.1
UI-TARS-1.5-7B	33.3	29.9	37.9	53.1	9.1	29.6
JEDI-7B w/ o3	50.0	46.1	61.9	75.5	35.3	51.0
GroundNext-3B w/ o3	62.5	47.0	55.0	73.5	36.5	50.6

Note: GroundNext-7B-V0 results with o3 integration forthcoming.

Quickstart

The code of GroundNext-7B-V0 is compatible with the latest Hugging Face transformers library and follows the Qwen2.5-VL implementation.

With transformers<4.37.0, you may encounter compatibility issues. We recommend using transformers>=4.37.0.

Installation

pip install transformers>=4.37.0 torch torchvision accelerate
pip install qwen-vl-utils  # For image processing utilities

Basic Inference

The following code snippet demonstrates how to use GroundNext-7B-V0 for GUI element grounding:

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from PIL import Image
import groundcua
import io
from urllib.request import urlopen

model_name = "ServiceNow/GroundNext-7B-V0"

# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
    trust_remote_code=True
).eval()

processor = AutoProcessor.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Configure generation
model.generation_config.temperature = groundcua.DEFAULT_TEMPERATURE
model.generation_config.do_sample = False
model.generation_config.use_cache = True

# Load and prepare image
url = "https://huggingface.co/datasets/ServiceNow/GroundCUA/resolve/main/images/7-Zip/001f0079a489909eb94e47c2374b7bf36ab1842e314592ce30a34d18a54eb1df.png"
image = Image.open(io.BytesIO(urlopen(url).read()))
image, (width, height) = groundcua.prepare_image(image)

# Create messages and generate
instruction = "Click on the 'File' button"
messages = groundcua.create_messages(instruction, image, width, height)

input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[input_text], images=[image], videos=None, padding=True, return_tensors="pt").to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=groundcua.DEFAULT_MAX_NEW_TOKENS)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]

response = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(response)
# Expected output: <tool_call>{"name": "computer_use", "arguments": {"action": "left_click", "coordinate": [x, y]}}</tool_call>

Deployment with vLLM

For production deployment, you can use vLLM to create OpenAI-compatible API endpoints:

vLLM:

vllm serve ServiceNow/GroundNext-7B-V0 --max-model-len 8192

Note: Adjust max-model-len or context-length based on your hardware capabilities. For typical GUI grounding tasks, 8192 tokens is sufficient.

Best Practices

To achieve optimal grounding performance, we recommend:

Image Preprocessing:
- Use high-resolution screenshots (minimum 800x600)
- Ensure UI elements are clearly visible
- Maintain original aspect ratios when resizing
Prompt Engineering:
- Be specific about the target element (e.g., "Click on the blue 'Submit' button in the top-right corner" or "Click on the following element: Save")
- Include element attributes when available (color, position, text)
Generation Parameters:
- Use temperature=0.0 for deterministic grounding
- Set max_new_tokens=128 (sufficient for tool calls)
- Enable use_cache=True for faster inference
System Prompt:
- Always include the system prompt with actual screen dimensions
- Replace {width} and {height} with true screenshot dimensions
- Maintain the tool signature format for proper JSON parsing
Post-processing:
- Parse <tool_call> tags to extract JSON
- Validate coordinates are within screen bounds

Training

GroundNext-7B-V0 was trained using a two-stage approach:

Supervised Fine-tuning (SFT): Trained on 700K human-annotated desktop demonstrations from the GroundCUA dataset
Reinforcement Learning (RLOO): Further optimized using reward-based learning with custom GUI grounding rewards

For detailed training instructions, dataset preparation, and reproduction steps, please visit our GitHub repository.

Limitations and Future Work

Desktop-focused: Primarily trained on desktop environments (though shows strong cross-platform generalization)
Action space: Currently supports mouse click action only
Languages: Optimized for English UI elements
Resolution: Performance may vary with extremely high or low resolution images

Citation

If you use GroundNext-7B-V0 in your research, please cite:

@misc{feizi2025groundingcomputeruseagents,
      title={Grounding Computer Use Agents on Human Demonstrations}, 
      author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han Lù and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar},
      year={2025},
      eprint={2511.07332},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2511.07332}, 
}