arxiv:2604.04911

SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

Published on Apr 6

· Submitted by

YC Xiao on Apr 7

JD Open Source

Upvote

Authors:

Wei Huang ,

Abstract

A new benchmark and dataset are introduced for evaluating fine-grained spatial editing capabilities, along with a model that demonstrates superior performance on spatial manipulation tasks.

AI-generated summary

Image spatial editing performs geometry-driven transformations, allowing precise control over object layout and camera viewpoints. Current models are insufficient for fine-grained spatial manipulations, motivating a dedicated assessment suite. Our contributions are listed: (i) We introduce SpatialEdit-Bench, a complete benchmark that evaluates spatial editing by jointly measuring perceptual plausibility and geometric fidelity via viewpoint reconstruction and framing analysis. (ii) To address the data bottleneck for scalable training, we construct SpatialEdit-500k, a synthetic dataset generated with a controllable Blender pipeline that renders objects across diverse backgrounds and systematic camera trajectories, providing precise ground-truth transformations for both object- and camera-centric operations. (iii) Building on this data, we develop SpatialEdit-16B, a baseline model for fine-grained spatial editing. Our method achieves competitive performance on general editing while substantially outperforming prior methods on spatial manipulation tasks. All resources will be made public at https://github.com/EasonXiao-888/SpatialEdit.

View arXiv page View PDF Project page GitHub 58 Add to collection

Community

EasonXiao-888

Paper submitter 1 day ago

avahal

about 12 hours ago

the part that sticks with me is how SpatialEdit-Bench ties perceptual plausibility to explicit geometry via ground-truth 3d pose reconstructions and framing analyses. i worry the framing and viewpoint metrics could be brittle if the underlying pose estimates are off, which is likely in real scenes with occlusion and calibration noise. a targeted ablation where you inject pose perturbations during evaluation would help disentangle edits that truly respect geometry from those that survive only under perfect supervision. a quick sanity check on transfer to real images with more varied lighting and backgrounds would also strengthen the claim that the method generalizes beyond blender-style data. btw the arxivlens breakdown helped me parse the method details, it’s a solid walkthrough that covers this paper well: https://arxivlens.com/PaperView/Details/spatialedit-benchmarking-fine-grained-image-spatial-editing-8722-6cbbe8c6