arxiv:2510.23607

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Published on Oct 27

· Submitted by

Xiaoyang Wu on Oct 28

#1 Paper of the day

Pointcept

Upvote

164

Authors:

Xiaoyang Wu ,

Chengyao Wang ,

Abstract

Concerto, a minimalist model combining 3D self-distillation and 2D-3D joint embedding, achieves superior spatial feature learning and outperforms existing models in scene understanding and open-world perception.

AI-generated summary

Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.

View arXiv page View PDF Project page GitHub 2.59k Add to collection

Community

Gofinge

Paper author Paper submitter 3 days ago

•

edited 3 days ago

TL;DR: Concerto provides joint 2D-3D self-supervised pre-trained Point Transformer V3 for 3D point cloud downstream tasks, modified from Sonata.

Homepage: https://pointcept.github.io/Concerto/
Gradio Demo: https://huggingface.co/spaces/Pointcept/Concerto
Inference Code: https://github.com/Pointcept/Concerto
Training Code: https://github.com/Pointcept/Pointcept

ma-xu

3 days ago

Thanks for the great work! 👍 👍 👍

dcaustin33

2 days ago

Very cool paper. With world models / splatting and really anything that has to operate in 3d space I feel like there is likely a lot of information we can distill from 2d models that learn some sort of depth information as evidenced by Dino and other SSL models are decent depth detectors as is.
Here is a bite sized podcast about the work in case anyone wants to listen to an AI overview: https://spotifycreators-web.app.link/e/qfVxqCBJRXb