arxiv:2603.23478

UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation

Published on Mar 24

· Submitted by

Jiaying Lin on Mar 26

Upvote

Authors:

Jiaying Lin ,

Abstract

UniFunc3D enables 3D scene functionality segmentation by treating multimodal large language models as active observers that perform joint semantic, temporal, and spatial reasoning through adaptive frame selection.

AI-generated summary

Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9\% mIoU improvement, without any task-specific training. Code will be released on our project page: https://jiaying.link/unifunc3d.

View arXiv page View PDF Project page Add to collection

Community

garrying

Paper author Paper submitter 1 day ago

Project page: https://jiaying.link/unifunc3d/

garrying

Paper author Paper submitter about 9 hours ago

UniFunc3D is a training-free framework that enables AI agents to accurately segment interactive 3D object parts from natural language instructions. By treating a Multimodal Large Language Model as an "active observer," it uses a coarse-to-fine strategy to adaptively zoom in on relevant details while maintaining global scene context. This approach significantly outperforms previous baselines, even the training-based methods, achieving a relative 59.9% mIoU improvement in accuracy on the SceneFun3D benchmark.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2603.23478

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.23478 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.23478 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.23478 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.