Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment [ICLR 2025]
This repository contains the LoRA weights for the Hummingbird model, presented in Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment. The Hummingbird model generates high-quality, diverse images from a multimodal context, preserving scene attributes and object interactions from both a reference image and text guidance.
Official implementation of paper: Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment
Prerequisites
Installation
- Clone this repository and navigate to hummingbird-1 folder
git clone https://github.com/roar-ai/hummingbird-1
cd hummingbird-1
- Create
conda
virtual environment with Python 3.9, PyTorch 2.0+ is recommended:
conda create -n hummingbird python=3.9
conda activate hummingbird
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
- Install additional packages for faster training and inference
pip install flash-attn --no-build-isolation
Download necessary models
- Clone our Hummingbird LoRA weight of UNet denoiser
git clone https://huggingface.co/lmquan/hummingbird
Refer to stabilityai/stable-diffusion-xl-base-1.0 to download SDXL pre-trained model and place it in the hummingbird weight directory as
./hummingbird/stable-diffusion-xl-base-1.0
.Download laion/CLIP-ViT-bigG-14-laion2B-39B-b160k for
feature extractor
andimage encoder
in Hummmingbird framework
cp -r CLIP-ViT-bigG-14-laion2B-39B-b160k ./hummingbird/stable-diffusion-xl-base-1.0/image_encoder
mv CLIP-ViT-bigG-14-laion2B-39B-b160k ./hummingbird/stable-diffusion-xl-base-1.0/feature_extractor
- Replace the file
model_index.json
of pre-trainedstable-diffusion-xl-base-1.0
with our customized version for Hummingbird framework
cp -r ./hummingbird/model_index.json ./hummingbird/stable-diffusion-xl-base-1.0/
- Download HPSv2 weights and put it here:
hpsv2/HPS_v2_compressed.pt
. - Download PickScore model weights and put it here:
pickscore/pickmodel/model.safetensors
.
Double check if everything is all set
|-- hummingbird-1/
|-- hpsv2
|-- HPS_v2_compressed.pt
|-- pickscore
|-- pickmodel
|-- config.json
|-- model.safetensors
|-- hummingbird
|-- model_index.json
|-- lora_unet_65000
|-- adapter_config.json
|-- adapter_model.safetensors
|-- stable-diffusion-xl-base-1.0
|-- model_index.json (replaced by our customized version, see step 4 above)
|-- feature_extractor (cloned from CLIP-ViT-bigG-14-laion2B-39B-b160k)
|-- image_encoder (cloned from CLIP-ViT-bigG-14-laion2B-39B-b160k)
|-- text_encoder
|-- text_encoder_2
|-- tokenizer
|-- tokenizer_2
|-- unet
|-- vae
|-- ...
|-- ...
Quick Start
Given a reference image, Hummingbird can generate diverse variants of it and preserve specific properties/attributes, for example:
python3 inference.py --reference_image ./examples/image-2.jpg --attribute "color of skateboard wheels" --output_path output.jpg
Training
You can train Hummingbird with the following script:
sh run_hummingbird.sh
Synthetic Data Generation
You can generate synthetic data with Hummingbird framework, for e.g. with MME Perception dataset:
python3 image_generation.py --generator hummingbird --dataset mme --save_image_gen ./synthetic_mme
Testing
Evaluate the fidelity of generated images w.r.t reference image using Test-Time Augmentation on MLLMs (LLaVA/InternVL2):
python3 test_hummingbird_mme.py --dataset mme --model llava --synthetic_dir ./synthetic_mme
Acknowledgement
We base on the implementation of TextCraftor. We thank BLIP-2 QFormer, HPSv2, PickScore, Aesthetic for the reward models and MLLMs LLaVA, InternVL2 functioning as context descriptors in our framework.
Citation
If you find this work helpful, please cite our paper:
@inproceedings{le2025hummingbird,
title={Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment},
author={Minh-Quan Le and Gaurav Mittal and Tianjian Meng and A S M Iftekhar and Vishwas Suryanarayanan and Barun Patra and Dimitris Samaras and Mei Chen},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=6kPBThI6ZJ}
}
- Downloads last month
- 2
Model tree for lmquan/hummingbird
Base model
stabilityai/stable-diffusion-xl-base-1.0