mi804 commited on
Commit
6d94f51
·
verified ·
1 Parent(s): dc755e6
Files changed (1) hide show
  1. README.md +155 -3
README.md CHANGED
@@ -1,3 +1,155 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ pipeline_tag: any-to-any
5
+ frameworks:
6
+ - Pytorch
7
+ tasks:
8
+ - any-to-any
9
+ ---
10
+
11
+ ## News
12
+ - **July 11, 2025**: **[Nexus-Gen V2](https://www.modelscope.cn/models/DiffSynth-Studio/Nexus-GenV2) is released**. Please check more details in the [technical report](http://arxiv.org/abs/2504.21356). The model is opitimized from the following aspects:
13
+ - Better image understanding capbility (**45.7 on [MMMU](https://github.com/MMMU-Benchmark/MMMU)**) through optimization on training schedules.
14
+ - Better image generation (**0.81 on [GenEval](https://github.com/djghosh13/geneval.git)**) robustness through training with long-short caption.
15
+ - Better reconstruction in image editing tasks. We have proposed a better editing decoder for Nexus-Gen.
16
+ - Support generation and editing with Chinese prompts.
17
+ - **May 27, 2025**: We fine-tuned Nexus-Gen using the [BLIP-3o-60k](https://huggingface.co/datasets/BLIP3o/BLIP3o-60k) dataset, significantly improving the model's robustness to text prompts in image generation, **achieving a GenEval score of 0.79**. The [model checkpoints](https://www.modelscope.cn/models/DiffSynth-Studio/Nexus-Gen) have been updated.
18
+
19
+ ## What is Nexus-Gen
20
+ Nexus-Gen is a unified model that synergizes the language reasoning capabilities of LLMs with the image synthesis power of diffusion models. We propose a unified image embedding spaces to model image understanding, generation and editing tasks. To perform joint optimization across multiple tasks, we curate a large-scale dataset of 26.3 million samples and train Nexus-Gen using a multi-stage strategy, which includes the multi-task pretraining of the autoregressive model and conditional adaptations of the generation and editing decoders.
21
+
22
+ More information please refer to our repo: https://github.com/modelscope/Nexus-Gen.git
23
+
24
+ ![architecture](assets/illustrations/architecture.jpg)
25
+
26
+
27
+ ## Model Inference
28
+ ### Installation
29
+ ```shell
30
+ # 1. Install [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio.git) from source
31
+ git clone https://github.com/modelscope/DiffSynth-Studio.git
32
+ cd DiffSynth-Studio
33
+ pip install -e .
34
+
35
+ # 2. Install requirements
36
+ pip install -r requirements.txt
37
+
38
+ # 3. Install ms-swift if you want to perform finetuning on Nexus-Gen.
39
+ pip install ms-swift==3.3.0.dev0
40
+ ```
41
+
42
+ ### Prepare models
43
+ Nexus-Gen adopts Qwen2.5-VL-Instruct 7B as its autoregressive model, and adopts FLUX.1-Dev as the vision decoders (including the generation decoder and editing decoder). You can run the following scripts to download the checkpoints.
44
+ ```shell
45
+ python download_models.py
46
+ ```
47
+ ### Image Understanding
48
+ Nexus-Gen inheret the image understanding ability of Qwen2.5-VL. Try the following script (Needs at least 17 GB VRAM).
49
+ ```shell
50
+ python image_understanding.py --input_image assets/examples/cat.png --instruction "Please give a brief description of the image"
51
+ ```
52
+
53
+ ### Image Generation
54
+ Try the following scripts to perform image generation (Needs at least 24 GB VRAM). Please see `image_generation.py` for details about the inference hyperparameters.
55
+ ```shell
56
+ python image_generation.py --prompt "A cute cat" --width 512 --height 512
57
+ ```
58
+ Nexus-GenV2 supports generation with chinese prompts. You may further set the Chinese template for image generation by setting `--language zh` as follows.
59
+ ```shell
60
+ python image_generation.py --prompt "一只可爱的猫" --language zh --width 1024 --height 1024
61
+ ```
62
+ ### Image Editing
63
+ The Nexus-Gen model comprises two decoders: a generation decoder and an editing decoder (recommended). The former directly utilizes the 81-dimensional embeddings output by the autoregressive model to generate images, while the latter additionally incorporates the original image's 324-dimensional embeddings, enabling more accurate reconstruction of unedited regions in the image.
64
+
65
+ Try the follow script to perform image editing with editing decoder.
66
+ ```shell
67
+ python image_editing.py --input_image assets/examples/cat.png --instruction "Add a pair of sunglasses"
68
+ ```
69
+
70
+ When performing large-region image edits such as conceptual modifications, it is recommended to employ the generation decoder. This approach allows the model's image generation capabilities to directly enhance its editing performance. Try the follow script to perform image editing with generation decoder.
71
+ ```shell
72
+ python image_editing.py --input_image assets/examples/cat.png --instruction "The cat is now running in a forest." --use_generation_decoder
73
+ ```
74
+
75
+ Nexus-Gen also supports image editing using Chinese prompts:
76
+ ```shell
77
+ python image_editing.py --input_image assets/examples/cat.png --instruction "给猫加一副太阳镜"
78
+ ```
79
+ Please see `image_editing.py` for details about the inference hyperparameters.
80
+ ### Gradio demo
81
+ Try Nexus-Gen with a gradio UI:
82
+ ```shell
83
+ python app.py
84
+ ```
85
+
86
+ ## Model training
87
+ We train Nexus-Gen using a multi-stage strategy, which includes the multi-task pretraining of the autoregressive model and conditional adaptations of the generation and editing decoders. The unified message-like dataset format is:
88
+ ```json
89
+ "images": ["xxx.jpg", "xxx.jpg"]
90
+ "messages": [
91
+ {"role": "user", "content": "<image> xxx"},
92
+ {"role": "assistant", "content": "xxx"},
93
+ {"role": "user", "content": "xxx"},
94
+ {"role": "assistant", "content": "xxx <image>"}
95
+ ]
96
+ ```
97
+ See `assets/example_datasets` for more examples.
98
+ ### 1. Multi-task pretraining for autoregressive model
99
+ The autoregressive model of Nexus-Gen is trained on image understanding, generation and editing tasks using [ms-swift](https://github.com/modelscope/ms-swift.git) framework. Please refer to `assets/example_datasets/llm_dataset.jsonl` for the example dataset.
100
+
101
+ Run the following script to perform finetuning on Nexus-Gen V2. Refer to the script for more configurations.
102
+ ```shell
103
+ bash train/scripts/train_autoregressive_model.sh
104
+ ```
105
+
106
+ If you would like to train the autoregressive model from sctrach, just replace the checkpoints of Nexus-Gen V2 with that of Qwen2.5-VL-7B-Instruct. Specially, replace the `*.safetensors` and `models/Nexus-GenV2/model.safetensors.index.json` files.
107
+
108
+ ### 2. Conditional adaptation for generation decoder
109
+ Generation decoder is trained by image reconstruction with the 81-token image embeddings. There are two steps to train it.
110
+
111
+ (1) Prepare for the embedding-image dataset: given the message-like dataset `assets/example_datasets/gen_decoder_dataset.jsonl`, run the following code to pre-calculate the embeddings for each image and get the embed-like dataset `assets/example_datasets/embeds_gen/gen_decoder_embeds_dataset.jsonl`
112
+ ```python
113
+ python train/utils/prepare_embeddataset_for_gen.py
114
+ ```
115
+ (2) Train the generation decoder: run the following script to train generation decoder.
116
+ ```shell
117
+ bash train/scripts/train_generation_decoder.sh
118
+ ```
119
+ Please refer to `train/configs/generation_decoder.yaml` for detailed configurations.
120
+
121
+ ### 3. Conditional adaptation for editing decoder
122
+ Editing decoder is trained on the ImagePulse dataset. There are two steps to train it.
123
+ (1) Prepare for the embedding-image dataset: given the message-like dataset `assets/example_datasets/edit_decoder_dataset.jsonl`, run the following code to pre-calculate the embeddings for the source and target images and get the embed-like dataset `assets/example_datasets/embeds_edit/edit_decoder_embeds_dataset.jsonl`
124
+ ```python
125
+ PYTHONPATH=$(pwd) python train/utils/prepare_embeddataset_for_edit.py
126
+ ```
127
+ (2) Train the editing decoder: run the following script to train editing decoder.
128
+ ```shell
129
+ bash train/scripts/train_editing_decoder.sh
130
+ ```
131
+ Please refer to `train/configs/editing_decoder.yaml` for detailed configurations. Please note that the projector of editing decoder includes a transformer layer, which is initialized from Qwen2.5-VL-7B-Instruct. So it is nessary to download the checkpoints to `models/Qwen/Qwen2.5-VL-7B-Instruct`:
132
+ ```shell
133
+ modelscope download --model Qwen/Qwen2.5-VL-7B-Instruct --local_dir models/Qwen/Qwen2.5-VL-7B-Instruct
134
+ ```
135
+ ## Training Datasets
136
+ To be published.
137
+
138
+ ## Qualitative results of Nexus-Gen
139
+ ![cover](assets/illustrations/gen_edit.jpg)
140
+
141
+ ## Limitations
142
+ - Please note that Nexus-Gen was trained on limited text-to-image data and may not be robust to text prompts.
143
+
144
+ ### Citation
145
+ ```
146
+ {zhang2025nexusgenunifiedimageunderstanding,
147
+ title={Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space},
148
+ author={Hong Zhang and Zhongjie Duan and Xingjun Wang and Yuze Zhao and Weiyi Lu and Zhipeng Di and Yixuan Xu and Yingda Chen and Yu Zhang},
149
+ year={2025},
150
+ eprint={2504.21356},
151
+ archivePrefix={arXiv},
152
+ primaryClass={cs.CV},
153
+ url={https://arxiv.org/abs/2504.21356},
154
+ }
155
+ ```