ๆœฌๆจกๅž‹ๆ˜ฏๅฎ˜ๆ–น https://huggingface.co/LanguageBind/UniWorld-V1 ๆจกๅž‹็š„ BnB 4bit ้ข„้‡ๅŒ–็‰ˆ๏ผŒๅคงๅคงๅ‡ๅฐ‘ๆจกๅž‹็š„ไธ‹่ฝฝใ€ๅญ˜ๆ”พๅ’Œๆ˜พๅญ˜ๅ ็”จ็ฉบ้—ดใ€‚

This is the https://huggingface.co/LanguageBind/UniWorld-V1 BnB 4bit quantization version.

ๅฆ‚ไฝ•ไฝฟ็”จ๏ผš่ฏฅ Repo ไธŽๅฎ˜ๆ–นๅŽŸๅง‹ FP32 ๆจกๅž‹ๅŠ ่ฝฝๆ–นๅผไธ€่‡ด๏ผŒ่ฏท็กฎ่ฎค Python ็Žฏๅขƒๅทฒๅฎ‰่ฃ… bitsandbytes ไพ่ต–ๅŒ…ใ€‚

How to load: The loading method is the same as the official original FP32 model, and you need to confirm that the bitsandbytes dependency package is installed at first.

UniWorld: High-Resolution Semantic Encoders for
Unified Visual Understanding and Generation

arXiv model data License Twitter
demo0 demo0 demo0 demo0 demo0 demo0 demo0 demo0
GitHub repo stars  GitHub repo forks  GitHub repo watchers  GitHub repo size
GitHub repo contributors GitHub Commit Pr GitHub issues GitHub closed issues

๐Ÿ“ฃ News

  • [2025.06.03] ๐Ÿค— We release UniWorld, a unified framework for understanding, generation, and editing. All data, models, training code, and evaluation code are open-sourced. Checking our report for more details. Welcome to watch ๐Ÿ‘€ this repository for the latest updates.

๐Ÿ˜ Gallery

UniWorld shows excellent performance in 20+ tasks.

UniWorld, trained on only 2.7M samples, consistently outperforms BAGEL (trained on 2665M samples) on the ImgEdit-Bench for image manipulation. It also surpasses the specialized image editing model Step1X-Edit across multiple dimensions, including add, adjust, and extract on ImgEdit-Bench.

Click to play

๐Ÿ˜ฎ Highlights

1. All Resources Fully Open-Sourced

  • We fully open-source the models, data, training and evaluation code to facilitate rapid community exploration of unified architectures.

  • We curate 10+ CV downstream tasks, including canny, depth, sketch, MLSD, segmentation and so on.

  • We annotate 286K long-caption samples using Qwen2-VL-72B. We use GPT-4o to filter ImgEdit, result in 724K high-quality editing samples (all shortedge โ‰ฅ 1024 pix). Additionally, we organize and filter existing open-sourced datasets. The details can be found here.

2. Contrastive Semantic Encoders as Reference Control Signals

  • Unlike prior approaches that use VAE-encoded reference images for low-level control, we advocate using contrastive visual encoders as control signals for reference images.

  • For such encoders, we observe that as resolution increases, global features approach saturation and model capacity shifts toward preserving fine details, which is crucial for maintaining fidelity in non-edited regions.

3. Image Priors via VLM Encoding Without Learnable Tokens

  • We find that multimodal features encoded by VLMs can interpret instructions while retaining image priors. Due to causal attention, the format <instruction><image> is particularly important.

๐Ÿค— Demo

Gradio Web UI

Highly recommend trying out our web demo by the following command.

MODEL_PATH="path/to/model"
FLUX_PATH="path/to/flux"
SIGLIP_PATH="path/to/siglip"
CUDA_VISIBLE_DEVICES=0 python -m univa.serve.gradio_web_server \
    --model_path ${MODEL_PATH} \
    --flux_path ${FLUX_PATH} \
    --siglip_path ${SIGLIP_PATH}

CLI Inference

MODEL_PATH="path/to/model"
FLUX_PATH="path/to/flux"
SIGLIP_PATH="path/to/siglip"
CUDA_VISIBLE_DEVICES=1 python -m univa.serve.cli \
    --model_path ${MODEL_PATH} \
    --flux_path ${FLUX_PATH} \
    --siglip_path ${SIGLIP_PATH}

ComfyUI

Coming soon...

โš™๏ธ Requirements and Installation

  1. Clone this repository and navigate to UniWorld folder
git clone https://github.com/PKU-YuanGroup/UniWorld
cd UniWorld
  1. Install required packages
conda create -n univa python=3.10 -y
conda activate univa
pip install -r requirements.txt

๐Ÿ—๏ธ Training

Data preparation

Download the data from LanguageBind/UniWorld-V1. The dataset consists of two parts: source images and annotation JSON files.

Prepare a data.txt file in the following format:

  1. The first column is the root path to the image.

  2. The second column is the corresponding annotation JSON file.

  3. The third column indicates whether to enable the region-weighting strategy. We recommend setting it to True for edited data and False for others.

data/BLIP3o-60k,json/blip3o_t2i_58859.json,false
data/coco2017_caption_canny-236k,coco2017_canny_236574.json,false
data/imgedit,json/imgedit/laion_add_part0_edit.json,true

We provide a simple online verification tool to check whether your paths are set in data.txt correctly.

python univa/serve/check_data.py

Data details

Text-to-Image Generation

  • BLIP3o-60k: We add text-to-image instructions to half of the data. [108 GB storage usage.]
  • OSP1024-286k: Sourced from internal data of the Open-Sora Plan, with captions generated using Qwen2-VL-72B. Images have an aspect ratio between 3:4 and 4:3, aesthetic score โ‰ฅ 6, and a short side โ‰ฅ 1024 pixels. [326 GB storage usage.]

Image Editing

  • imgedit-724k: Data is filtered using GPT-4o, retaining approximately half. [2.1T storage usage.]
  • OmniEdit-368k: For image editing data, samples with edited regions smaller than 1/100 were filtered out; images have a short side โ‰ฅ 1024 pixels. [204 GB storage usage.]
  • SEED-Data-Edit-Part1-Openimages-65k: For image editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side โ‰ฅ 1024 pixels. [10 GB storage usage.]
  • SEED-Data-Edit-Part2-3-12k: For image editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side โ‰ฅ 1024 pixels. [10 GB storage usage.]
  • PromptfixData-18k: For image restoration data and some editing data, samples with edited regions smaller than 1/100 were filtered out. Images have a short side โ‰ฅ 1024 pixels. [9 GB storage usage.]
  • StyleBooth-11k: For transfer style data, images have a short side โ‰ฅ 1024 pixels. [4 GB storage usage.]
  • Ghibli-36k: For transfer style data, images have a short side โ‰ฅ 1024 pixels. Warning: This data has not been quality filtered. [170 GB storage usage.]

Extract & Try-on

  • viton_hd-23k: Converted from the source data into an instruction dataset for product extraction. [1 GB storage usage.]
  • deepfashion-27k: Converted from the source data into an instruction dataset for product extraction. [1 GB storage usage.]
  • shop_product-23k: Sourced from internal data of the Open-Sora Plan, focusing on product extraction and virtual try-on, with images having a short side โ‰ฅ 1024 pixels. [12 GB storage usage.]

Image Perception

Training

Prepare pretrained weights

Download black-forest-labs/FLUX.1-dev to $FLUX_PATH. Download Qwen/Qwen2.5-VL-7B-Instruct to $QWENVL_PATH. We also support other sizes of Qwen2.5-VL.

SAVE_PATH="path/to/save/UniWorld-Qwen2.5-VL-7B-Instruct-FLUX.1-dev-fp32"
python scripts/make_univa_qwen2p5vl_weight.py \
    --origin_flux_ckpt_path $FLUX_PATH \
    --origin_qwenvl_ckpt_path $QWENVL_PATH \
    --save_path ${SAVE_PATH}
# stage1
bash scripts/denoiser/flux_qwen2p5vl_7b_vlm_stage1_512.sh

Download flux-redux-siglipv2-512.bin and set its path to pretrained_siglip_mlp_path in stage2.yaml. The weight is sourced from ostris/Flex.1-alpha-Redux, we just re-organize the weight.

# stage2
bash scripts/denoiser/flux_qwen2p5vl_7b_vlm_stage2_512.sh

โšก๏ธ Evaluation

Text-to-Image Generation

GenEval

cd univa/eval/geneval
# follow the instruction in univa/eval/geneval/README.md

WISE

cd univa/eval/wise
# follow the instruction in univa/eval/wise/README.md

GenAI-Bench

cd univa/eval/genai
# follow the instruction in univa/eval/genai/README.md

DPG-Bench

cd univa/eval/dpgbench
# follow the instruction in univa/eval/dpgbench/README.md

Image Editing

ImgEdit

cd univa/eval/imgedit
# follow the instruction in univa/eval/imgedit/README.md

GEdit

cd univa/eval/gdit
# follow the instruction in univa/eval/gdit/README.md

๐Ÿ“Š Benchmarks

๐Ÿ’ก How to Contribute

We greatly appreciate your contributions to the UniWorld open-source community and helping us make it even better than it is now!

For more details, please refer to the Contribution Guidelines.

๐Ÿ‘ Acknowledgement and Related Work

  • ImgEdit: ImgEdit is a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs.
  • Open-Sora Plan: An openโ€‘source text-to-image/video foundation model, which provides a lot of caption data.
  • SEED-Data-Edit: A hybrid dataset for instruction-guided image editing.
  • Qwen2.5-VL: The new flagship vision-language model of Qwen.
  • FLUX.1-Redux-dev: Given an input image, FLUX.1 Redux can reproduce the image with slight variation, allowing to refine a given image.
  • SigLIP 2: New multilingual vision-language encoders.
  • Step1X-Edit: A state-of-the-art image editing model.
  • BLIP3-o: A unified multimodal model that combines the reasoning and instruction following strength of autoregressive models with the generative power of diffusion models.
  • BAGEL: An openโ€‘source multimodal foundation model with 7B active parameters (14B total) trained on largeโ€‘scale interleaved multimodal data.

๐Ÿ”’ License

โœจ Star History

Star History

โœ๏ธ Citing

@misc{lin2025uniworldhighresolutionsemanticencoders,
      title={UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation}, 
      author={Bin Lin and Zongjian Li and Xinhua Cheng and Yuwei Niu and Yang Ye and Xianyi He and Shenghai Yuan and Wangbo Yu and Shaodong Wang and Yunyang Ge and Yatian Pang and Li Yuan},
      year={2025},
      eprint={2506.03147},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.03147}, 
}
@article{niu2025wise,
  title={Wise: A world knowledge-informed semantic evaluation for text-to-image generation},
  author={Niu, Yuwei and Ning, Munan and Zheng, Mengren and Lin, Bin and Jin, Peng and Liao, Jiaqi and Ning, Kunpeng and Zhu, Bin and Yuan, Li},
  journal={arXiv preprint arXiv:2503.07265},
  year={2025}
}
@article{lin2024open,
  title={Open-Sora Plan: Open-Source Large Video Generation Model},
  author={Lin, Bin and Ge, Yunyang and Cheng, Xinhua and Li, Zongjian and Zhu, Bin and Wang, Shaodong and He, Xianyi and Ye, Yang and Yuan, Shenghai and Chen, Liuhan and others},
  journal={arXiv preprint arXiv:2412.00131},
  year={2024}
}

๐Ÿค Community contributors

This model is presented in the paper: UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Downloads last month
12
Safetensors
Model size
11B params
Tensor type
F32
ยท
BF16
ยท
U8
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for wikeeyang/UniWorld-V1-NF4

Quantized
(1)
this model