Papers
arxiv:2510.23607

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Published on Oct 27
ยท Submitted by Xiaoyang Wu on Oct 28
#1 Paper of the day
Authors:
,
,
,
,

Abstract

Concerto, a minimalist model combining 3D self-distillation and 2D-3D joint embedding, achieves superior spatial feature learning and outperforms existing models in scene understanding and open-world perception.

AI-generated summary

Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.

Community

Paper author Paper submitter
โ€ข
edited 3 days ago

TL;DR: Concerto provides joint 2D-3D self-supervised pre-trained Point Transformer V3 for 3D point cloud downstream tasks, modified from Sonata.

Homepage: https://pointcept.github.io/Concerto/
Gradio Demo: https://huggingface.co/spaces/Pointcept/Concerto
Inference Code: https://github.com/Pointcept/Concerto
Training Code: https://github.com/Pointcept/Pointcept

image

Thanks for the great work! ๐Ÿ‘ ๐Ÿ‘ ๐Ÿ‘

Very cool paper. With world models / splatting and really anything that has to operate in 3d space I feel like there is likely a lot of information we can distill from 2d models that learn some sort of depth information as evidenced by Dino and other SSL models are decent depth detectors as is.
Here is a bite sized podcast about the work in case anyone wants to listen to an AI overview: https://spotifycreators-web.app.link/e/qfVxqCBJRXb

ยท
Paper author

Thank you

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.23607 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 8