Abstract
A new benchmark and dataset are introduced for evaluating fine-grained spatial editing capabilities, along with a model that demonstrates superior performance on spatial manipulation tasks.
Image spatial editing performs geometry-driven transformations, allowing precise control over object layout and camera viewpoints. Current models are insufficient for fine-grained spatial manipulations, motivating a dedicated assessment suite. Our contributions are listed: (i) We introduce SpatialEdit-Bench, a complete benchmark that evaluates spatial editing by jointly measuring perceptual plausibility and geometric fidelity via viewpoint reconstruction and framing analysis. (ii) To address the data bottleneck for scalable training, we construct SpatialEdit-500k, a synthetic dataset generated with a controllable Blender pipeline that renders objects across diverse backgrounds and systematic camera trajectories, providing precise ground-truth transformations for both object- and camera-centric operations. (iii) Building on this data, we develop SpatialEdit-16B, a baseline model for fine-grained spatial editing. Our method achieves competitive performance on general editing while substantially outperforming prior methods on spatial manipulation tasks. All resources will be made public at https://github.com/EasonXiao-888/SpatialEdit.
Community
the part that sticks with me is how SpatialEdit-Bench ties perceptual plausibility to explicit geometry via ground-truth 3d pose reconstructions and framing analyses. i worry the framing and viewpoint metrics could be brittle if the underlying pose estimates are off, which is likely in real scenes with occlusion and calibration noise. a targeted ablation where you inject pose perturbations during evaluation would help disentangle edits that truly respect geometry from those that survive only under perfect supervision. a quick sanity check on transfer to real images with more varied lighting and backgrounds would also strengthen the claim that the method generalizes beyond blender-style data. btw the arxivlens breakdown helped me parse the method details, it’s a solid walkthrough that covers this paper well: https://arxivlens.com/PaperView/Details/spatialedit-benchmarking-fine-grained-image-spatial-editing-8722-6cbbe8c6
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes (2026)
- VeloEdit: Training-Free Consistent and Continuous Instruction-Based Image Editing via Velocity Field Decomposition (2026)
- Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation (2026)
- Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence (2026)
- AutoWeather4D: Autonomous Driving Video Weather Conversion via G-Buffer Dual-Pass Editing (2026)
- ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation (2026)
- TexEditor: Structure-Preserving Text-Driven Texture Editing (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.04911 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper