GraspMolmo

[Paper] [arXiv] [Project Website] [Data]

GraspMolmo is a generalizable open-vocabulary task-oriented grasping (TOG) model for robotic manipulation. Given an image and a task to complete (e.g. "Pour me some tea"), GraspMolmo will point to the most appropriate grasp location, which can then be matched to the closest stable grasp.

Code Sample

from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

img = Image.open("<path_to_image>")
task = "Pour coffee from the blue mug."

processor = AutoProcessor.from_pretrained("allenai/GraspMolmo", torch_dtype="auto", device_map="auto", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("allenai/GraspMolmo", torch_dtype="auto", device_map="auto", trust_remote_code=True)

prompt = f"Point to where I should grasp to accomplish the following task: {task}"
inputs = processor.process(images=img, text=prompt, return_tensors="pt")
inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}

output = model.generate_from_batch(inputs, GenerationConfig(max_new_tokens=256, stop_strings="<|endoftext|>"), tokenizer=processor.tokenizer)
generated_tokens = output[0, inputs["input_ids"].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(generated_text)

Running the above code could result in the following output:

In order to accomplish the task "Pour coffee from the blue mug.", the optimal grasp is described as follows: "The grasp is on the middle handle of the blue mug, with fingers grasping the sides of the handle.".

<point x="28.6" y="20.7" alt="Where to grasp the object">Where to grasp the object</point>

Grasp Inference

To predict a grasp point and match it to one of the candidate grasps, refer to the GraspMolmo class. First, install graspmolmo with

pip install "git+https://github.com/abhaybd/GraspMolmo.git#egg=graspmolmo[infer]"

and then inference can be run as follows:

from graspmolmo.inference.grasp_predictor import GraspMolmo

task = "..."
rgb, depth = get_image()
camera_intrinsics = np.array(...)

point_cloud = backproject(rgb, depth, camera_intrinsics)
# grasps are in the camera reference frame
grasps = predict_grasps(point_cloud)  # Using your favorite grasp predictor (e.g. M2T2)

gm = GraspMolmo()
idx = gm.pred_grasp(rgb, point_cloud, task, grasps)

print(f"Predicted grasp: {grasps[idx]}")
Downloads last month
244
Safetensors
Model size
8.02B params
Tensor type
F32
·
Video Preview
loading

Model tree for allenai/GraspMolmo

Base model

Qwen/Qwen2-7B
Finetuned
(2)
this model

Dataset used to train allenai/GraspMolmo