Nexus-GenV2 / README.md

Update Readme (#2)

6d94f51 verified 8 days ago

8.71 kB

	---
	library_name: transformers
	license: apache-2.0
	pipeline_tag: any-to-any
	frameworks:
	- Pytorch
	tasks:
	- any-to-any
	---

	## News
	- July 11, 2025: [Nexus-Gen V2](https://www.modelscope.cn/models/DiffSynth-Studio/Nexus-GenV2) is released. Please check more details in the [technical report](http://arxiv.org/abs/2504.21356). The model is opitimized from the following aspects:
	- Better image understanding capbility (45.7 on [MMMU](https://github.com/MMMU-Benchmark/MMMU)) through optimization on training schedules.
	- Better image generation (0.81 on [GenEval](https://github.com/djghosh13/geneval.git)) robustness through training with long-short caption.
	- Better reconstruction in image editing tasks. We have proposed a better editing decoder for Nexus-Gen.
	- Support generation and editing with Chinese prompts.
	- May 27, 2025: We fine-tuned Nexus-Gen using the [BLIP-3o-60k](https://huggingface.co/datasets/BLIP3o/BLIP3o-60k) dataset, significantly improving the model's robustness to text prompts in image generation, achieving a GenEval score of 0.79. The [model checkpoints](https://www.modelscope.cn/models/DiffSynth-Studio/Nexus-Gen) have been updated.

	## What is Nexus-Gen
	Nexus-Gen is a unified model that synergizes the language reasoning capabilities of LLMs with the image synthesis power of diffusion models. We propose a unified image embedding spaces to model image understanding, generation and editing tasks. To perform joint optimization across multiple tasks, we curate a large-scale dataset of 26.3 million samples and train Nexus-Gen using a multi-stage strategy, which includes the multi-task pretraining of the autoregressive model and conditional adaptations of the generation and editing decoders.

	More information please refer to our repo: https://github.com/modelscope/Nexus-Gen.git

	![architecture](assets/illustrations/architecture.jpg)


	## Model Inference
	### Installation
	```shell
	# 1. Install [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio.git) from source
	git clone https://github.com/modelscope/DiffSynth-Studio.git
	cd DiffSynth-Studio
	pip install -e .

	# 2. Install requirements
	pip install -r requirements.txt

	# 3. Install ms-swift if you want to perform finetuning on Nexus-Gen.
	pip install ms-swift==3.3.0.dev0
	```

	### Prepare models
	Nexus-Gen adopts Qwen2.5-VL-Instruct 7B as its autoregressive model, and adopts FLUX.1-Dev as the vision decoders (including the generation decoder and editing decoder). You can run the following scripts to download the checkpoints.
	```shell
	python download_models.py
	```
	### Image Understanding
	Nexus-Gen inheret the image understanding ability of Qwen2.5-VL. Try the following script (Needs at least 17 GB VRAM).
	```shell
	python image_understanding.py --input_image assets/examples/cat.png --instruction "Please give a brief description of the image"
	```

	### Image Generation
	Try the following scripts to perform image generation (Needs at least 24 GB VRAM). Please see `image_generation.py` for details about the inference hyperparameters.
	```shell
	python image_generation.py --prompt "A cute cat" --width 512 --height 512
	```
	Nexus-GenV2 supports generation with chinese prompts. You may further set the Chinese template for image generation by setting `--language zh` as follows.
	```shell
	python image_generation.py --prompt "一只可爱的猫" --language zh --width 1024 --height 1024
	```
	### Image Editing
	The Nexus-Gen model comprises two decoders: a generation decoder and an editing decoder (recommended). The former directly utilizes the 81-dimensional embeddings output by the autoregressive model to generate images, while the latter additionally incorporates the original image's 324-dimensional embeddings, enabling more accurate reconstruction of unedited regions in the image.

	Try the follow script to perform image editing with editing decoder.
	```shell
	python image_editing.py --input_image assets/examples/cat.png --instruction "Add a pair of sunglasses"
	```

	When performing large-region image edits such as conceptual modifications, it is recommended to employ the generation decoder. This approach allows the model's image generation capabilities to directly enhance its editing performance. Try the follow script to perform image editing with generation decoder.
	```shell
	python image_editing.py --input_image assets/examples/cat.png --instruction "The cat is now running in a forest." --use_generation_decoder
	```

	Nexus-Gen also supports image editing using Chinese prompts:
	```shell
	python image_editing.py --input_image assets/examples/cat.png --instruction "给猫加一副太阳镜"
	```
	Please see `image_editing.py` for details about the inference hyperparameters.
	### Gradio demo
	Try Nexus-Gen with a gradio UI:
	```shell
	python app.py
	```

	## Model training
	We train Nexus-Gen using a multi-stage strategy, which includes the multi-task pretraining of the autoregressive model and conditional adaptations of the generation and editing decoders. The unified message-like dataset format is:
	```json
	"images": ["xxx.jpg", "xxx.jpg"]
	"messages": [
	{"role": "user", "content": "<image> xxx"},
	{"role": "assistant", "content": "xxx"},
	{"role": "user", "content": "xxx"},
	{"role": "assistant", "content": "xxx <image>"}
	]
	```
	See `assets/example_datasets` for more examples.
	### 1. Multi-task pretraining for autoregressive model
	The autoregressive model of Nexus-Gen is trained on image understanding, generation and editing tasks using [ms-swift](https://github.com/modelscope/ms-swift.git) framework. Please refer to `assets/example_datasets/llm_dataset.jsonl` for the example dataset.

	Run the following script to perform finetuning on Nexus-Gen V2. Refer to the script for more configurations.
	```shell
	bash train/scripts/train_autoregressive_model.sh
	```

	If you would like to train the autoregressive model from sctrach, just replace the checkpoints of Nexus-Gen V2 with that of Qwen2.5-VL-7B-Instruct. Specially, replace the `*.safetensors` and `models/Nexus-GenV2/model.safetensors.index.json` files.

	### 2. Conditional adaptation for generation decoder
	Generation decoder is trained by image reconstruction with the 81-token image embeddings. There are two steps to train it.

	(1) Prepare for the embedding-image dataset: given the message-like dataset `assets/example_datasets/gen_decoder_dataset.jsonl`, run the following code to pre-calculate the embeddings for each image and get the embed-like dataset `assets/example_datasets/embeds_gen/gen_decoder_embeds_dataset.jsonl`
	```python
	python train/utils/prepare_embeddataset_for_gen.py
	```
	(2) Train the generation decoder: run the following script to train generation decoder.
	```shell
	bash train/scripts/train_generation_decoder.sh
	```
	Please refer to `train/configs/generation_decoder.yaml` for detailed configurations.

	### 3. Conditional adaptation for editing decoder
	Editing decoder is trained on the ImagePulse dataset. There are two steps to train it.
	(1) Prepare for the embedding-image dataset: given the message-like dataset `assets/example_datasets/edit_decoder_dataset.jsonl`, run the following code to pre-calculate the embeddings for the source and target images and get the embed-like dataset `assets/example_datasets/embeds_edit/edit_decoder_embeds_dataset.jsonl`
	```python
	PYTHONPATH=$(pwd) python train/utils/prepare_embeddataset_for_edit.py
	```
	(2) Train the editing decoder: run the following script to train editing decoder.
	```shell
	bash train/scripts/train_editing_decoder.sh
	```
	Please refer to `train/configs/editing_decoder.yaml` for detailed configurations. Please note that the projector of editing decoder includes a transformer layer, which is initialized from Qwen2.5-VL-7B-Instruct. So it is nessary to download the checkpoints to `models/Qwen/Qwen2.5-VL-7B-Instruct`:
	```shell
	modelscope download --model Qwen/Qwen2.5-VL-7B-Instruct --local_dir models/Qwen/Qwen2.5-VL-7B-Instruct
	```
	## Training Datasets
	To be published.

	## Qualitative results of Nexus-Gen
	![cover](assets/illustrations/gen_edit.jpg)

	## Limitations
	- Please note that Nexus-Gen was trained on limited text-to-image data and may not be robust to text prompts.

	### Citation
	```
	{zhang2025nexusgenunifiedimageunderstanding,
	title={Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space},
	author={Hong Zhang and Zhongjie Duan and Xingjun Wang and Yuze Zhao and Weiyi Lu and Zhipeng Di and Yixuan Xu and Yingda Chen and Yu Zhang},
	year={2025},
	eprint={2504.21356},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2504.21356},
	}
	```