|
--- |
|
license: mit |
|
pipeline_tag: image-to-3d |
|
library_name: diffusers |
|
--- |
|
|
|
<div align="center"> |
|
|
|
# ✨LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion✨ |
|
|
|
<p align="center"> |
|
<a href="https://liuff19.github.io/">Fangfu Liu</a><sup>1</sup>, |
|
<a href="https://lifuguan.github.io/">Hao Li</a><sup>2</sup>, |
|
<a href="https://github.com/chijw">Jiawei Chi</a><sup>1</sup>, |
|
<a href="https://hanyang-21.github.io/">Hanyang Wang</a><sup>1,3</sup>, |
|
<a href="https://github.com/liuff19/LangScene-X">Minghui Yang</a><sup>3</sup>, |
|
<a href="https://github.com/liuff19/LangScene-X">Fudong Wang</a><sup>3</sup>, |
|
<a href="https://duanyueqi.github.io/">Yueqi Duan</a><sup>1</sup> |
|
<br> |
|
<sup>1</sup>Tsinghua University, <sup>2</sup>NTU, <sup>3</sup>Ant Group |
|
</p> |
|
<h3 align="center">ICCV 2025 🔥</h3> |
|
<a href="https://arxiv.org/abs/2507.02813"><img src='https://img.shields.io/badge/arXiv-2507.02813-b31b1b.svg'></a> |
|
<a href="https://liuff19.github.io/LangScene-X"><img src='https://img.shields.io/badge/Project-Page-Green'></a> |
|
<a href="https://huggingface.co/chijw/LangScene-X"><img src='https://img.shields.io/badge/LangSceneX-huggingface-yellow'></a> |
|
<a><img src='https://img.shields.io/badge/License-MIT-blue'></a> |
|
|
|
 |
|
</div> |
|
|
|
**LangScene-X:** We propose LangScene-X, a unified model that generates RGB, segmentation map, and normal map, enabling to reconstruct 3D field from sparse views input. |
|
|
|
## 📄 Paper |
|
The model was presented in the paper [LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion](https://huggingface.co/papers/2507.02813). |
|
|
|
## 🔗 Links |
|
- Repository: [https://github.com/liuff19/LangScene-X/](https://github.com/liuff19/LangScene-X/) |
|
- Project Page: [https://liuff19.github.io/LangScene-X/](https://liuff19.github.io/LangScene-X/) |
|
- arXiv: [https://arxiv.org/abs/2507.02813](https://arxiv.org/abs/2507.02813) |
|
|
|
## 📖 Abstract |
|
|
|
Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when limited views are available. In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views. Specifically, we first train a TriMap video diffusion model that can generate appearance (RGBs), geometry (normals), and semantics (segmentation maps) from sparse inputs through progressive knowledge integration. Furthermore, we propose a Language Quantized Compressor (LQC), trained on large-scale image datasets, to efficiently encode language embeddings, enabling cross-scene generalization without per-scene retraining. Finally, we reconstruct the language surface fields by aligning language information onto the surface of 3D scenes, enabling open-ended language queries. Extensive experiments on real-world data demonstrate the superiority of our LangScene-X over state-of-the-art methods in terms of quality and generalizability. |
|
|
|
## 📢 News |
|
- 🔥 [04/07/2025] We release "LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion". Check our [project page](https://liuff19.github.io/LangScene-X) and [arXiv paper](https://arxiv.org/abs/2507.02813). |
|
|
|
## 🌟 Pipeline |
|
|
|
 |
|
|
|
Pipeline of LangScene-X. Our model is composed of a TriMap Video Diffusion model which generates RGB, segmentation map, and normal map videos, an Auto Encoder that compresses the language feature, and a field constructor that reconstructs 3DGS from the generated videos. |
|
|
|
|
|
## 🎨 Video Demos from TriMap Video Diffusion |
|
|
|
https://github.com/user-attachments/assets/55346d53-eb04-490e-bb70-64555e97e040 |
|
|
|
https://github.com/user-attachments/assets/d6eb28b9-2af8-49a7-bb8b-0d4cba7843a5 |
|
|
|
https://github.com/user-attachments/assets/396f11ef-85dc-41de-882e-e249c25b9961 |
|
|
|
## ⚙️ Setup |
|
|
|
### 1. Clone Repository |
|
```bash |
|
git clone https://github.com/liuff19/LangScene-X.git |
|
cd LangScene-X |
|
``` |
|
### 2. Environment Setup |
|
|
|
1. **Create conda environment** |
|
|
|
```bash |
|
conda create -n langscenex python=3.10 -y |
|
conda activate langscenex |
|
``` |
|
2. **Install dependencies** |
|
```bash |
|
conda install pytorch torchvision -c pytorch -y |
|
pip install -e field_construction/submodules/simple-knn |
|
pip install -e field_construction/submodules/diff-langsurf-rasterizer |
|
pip install -e auto-seg/submodules/segment-anything-1 |
|
pip install -e auto-seg/submodules/segment-anything-2 |
|
pip install -r requirements.txt |
|
``` |
|
|
|
### 3. Model Checkpoints |
|
The checkpoints of SAM, SAM2 and fine-tuned CogVideoX can be downloaded from our [huggingface repository](https://huggingface.co/chijw/LangScene-X). |
|
|
|
## 💻Running |
|
|
|
### Quick Start |
|
You can start quickly by running the following scripts: |
|
```bash |
|
chmod +x quick_start.sh |
|
./quick_start.sh <first_rgb_image_path> <last_rgb_image_path> |
|
``` |
|
### Render |
|
Run the following command to render from the reconstructed 3DGS field: |
|
```bash |
|
python entry_point.py \ |
|
pipeline.rgb_video_path="does/not/matter" \ |
|
pipeline.normal_video_path="does/not/matter" \ |
|
pipeline.seg_video_path="does/not/matter" \ |
|
pipeline.data_path="does/not/matter" \ |
|
gaussian.dataset.source_path="does/not/matter" \ |
|
gaussian.dataset.model_path="output/path" \ |
|
pipeline.selection=False \ |
|
gaussian.opt.max_geo_iter=1500 \ |
|
gaussian.opt.normal_optim=True \ |
|
gaussian.opt.optim_pose=True \ |
|
pipeline.skip_video_process=True \ |
|
pipeline.skip_lang_feature_extraction=True \ |
|
pipeline.mode="render" |
|
``` |
|
You can also configurate by editting `configs/field_construction.yaml`. |
|
|
|
## 🔗Acknowledgement |
|
|
|
We are thankful for the following great works when implementing LangScene-X: |
|
|
|
- [CogVideoX](https://github.com/THUDM/CogVideo), [CogvideX-Interpolation](https://github.com/feizc/CogvideX-Interpolation), [LangSplat](https://github.com/minghanqin/LangSplat), [LangSurf](https://github.com/lifuguan/LangSurf), [VGGT](https://github.com/facebookresearch/vggt), [3DGS](https://github.com/graphdeco-inria/gaussian-splatting), [SAM2](https://github.com/facebookresearch/sam2) |
|
|
|
## 📚Citation |
|
|
|
```bibtex |
|
@misc{liu2025langscenexreconstructgeneralizable3d, |
|
title={LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion}, |
|
author={Fangfu Liu and Hao Li and Jiawei Chi and Hanyang Wang and Minghui Yang and Fudong Wang and Yueqi Duan}, |
|
year={2025}, |
|
eprint={2507.02813}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2507.02813}, |
|
} |
|
``` |