Avada11's picture
Improve model card: Add pipeline tag, abstract, and sample usage (#1)
7e95a3e verified
metadata
license: cc-by-nc-4.0
pipeline_tag: depth-estimation

Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots

This repository contains the Camera Depth Models (CDMs) presented in the paper Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots.

CDMs are proposed as simple plugins for daily-use depth cameras. They take RGB images and raw depth signals as input and output denoised, accurate metric depth. This approach addresses challenges in using depth cameras for robotic manipulation, such as limited accuracy and noise susceptibility, effectively bridging the sim-to-real gap for manipulation tasks.

Abstract

Modern robotic manipulation primarily relies on visual observations in a 2D color space for skill learning but suffers from poor generalization. In contrast, humans, living in a 3D world, depend more on physical properties-such as distance, size, and shape-than on texture when interacting with objects. Since such 3D geometric information can be acquired from widely available depth cameras, it appears feasible to endow robots with similar perceptual capabilities. Our pilot study found that using depth cameras for manipulation is challenging, primarily due to their limited accuracy and susceptibility to various types of noise. In this work, we propose Camera Depth Models (CDMs) as a simple plugin on daily-use depth cameras, which take RGB images and raw depth signals as input and output denoised, accurate metric depth. To achieve this, we develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera's noise pattern. Our results show that CDMs achieve nearly simulation-level accuracy in depth prediction, effectively bridging the sim-to-real gap for manipulation tasks. Notably, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise or real-world fine-tuning, generalizes seamlessly to real-world robots on two challenging long-horizon tasks involving articulated, reflective, and slender objects, with little to no performance degradation. We hope our findings will inspire future research in utilizing simulation data and 3D information in general robot policies.

Project Page

For more details and resources, visit the project page: https://manipulation-as-in-simulation.github.io

Code

The full code, additional details, and further instructions can be found in the official GitHub repository: https://github.com/ByteDance-Seed/manip-as-in-sim-suite

For specific model inference instructions, refer to the CDM inference guide on GitHub.

Sample Usage

To run depth inference on RGB-D camera data using CDM, follow this example from the repository:

cd cdm
python infer.py \
    --encoder vitl \
    --model-path /path/to/model.pth \
    --rgb-image /path/to/rgb.jpg \
    --depth-image /path/to/depth.png \
    --output result.png