Spaces:

multimodalart
/

latentdiffusion

Paused

App Files Files Community

latentdiffusion / latent-diffusion /README.md

apolinario

Reinstate the app

24c0900 over 3 years ago

preview code

raw

history blame

16.3 kB

	# Latent Diffusion Models
	[arXiv](https://arxiv.org/abs/2112.10752) \| [BibTeX](#bibtex)

	<p align="center">
	<img src=assets/results.gif />
	</p>



	[High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)<br/>
	[Robin Rombach](https://github.com/rromb)\*,
	[Andreas Blattmann](https://github.com/ablattmann)\*,
	[Dominik Lorenz](https://github.com/qp-qp)\,
	[Patrick Esser](https://github.com/pesser),
	[Björn Ommer](https://hci.iwr.uni-heidelberg.de/Staff/bommer)<br/>
	\* equal contribution

	<p align="center">
	<img src=assets/modelfigure.png />
	</p>

	## News
	### April 2022
	- More pre-trained LDMs are available:
	- A 1.45B [model](#text-to-image) trained on the [LAION-400M](https://arxiv.org/abs/2111.02114) database.
	- A class-conditional model on ImageNet, achieving a FID of 3.6 when using [classifier-free guidance](https://openreview.net/pdf?id=qw8AKxfYbI) Available via a [colab notebook](https://colab.research.google.com/github/CompVis/latent-diffusion/blob/main/scripts/latent_imagenet_diffusion.ipynb) [![][colab]][colab-cin].

	## Requirements
	A suitable [conda](https://conda.io/) environment named `ldm` can be created
	and activated with:

	```
	conda env create -f environment.yaml
	conda activate ldm
	```

	# Pretrained Models
	A general list of all available checkpoints is available in via our [model zoo](#model-zoo).
	If you use any of these models in your work, we are always happy to receive a [citation](#bibtex).

	## Text-to-Image
	![text2img-figure](assets/txt2img-preview.png)


	Download the pre-trained weights (5.7GB)
	```
	mkdir -p models/ldm/text2img-large/
	wget -O models/ldm/text2img-large/model.ckpt https://ommer-lab.com/files/latent-diffusion/nitro/txt2img-f8-large/model.ckpt
	```
	and sample with
	```
	python scripts/txt2img.py --prompt "a virus monster is playing guitar, oil on canvas" --ddim_eta 0.0 --n_samples 4 --n_iter 4 --scale 5.0 --ddim_steps 50
	```
	This will save each sample individually as well as a grid of size `n_iter` x `n_samples` at the specified output location (default: `outputs/txt2img-samples`).
	Quality, sampling speed and diversity are best controlled via the `scale`, `ddim_steps` and `ddim_eta` arguments.
	As a rule of thumb, higher values of `scale` produce better samples at the cost of a reduced output diversity.
	Furthermore, increasing `ddim_steps` generally also gives higher quality samples, but returns are diminishing for values > 250.
	Fast sampling (i.e. low values of `ddim_steps`) while retaining good quality can be achieved by using `--ddim_eta 0.0`.
	Faster sampling (i.e. even lower values of `ddim_steps`) while retaining good quality can be achieved by using `--ddim_eta 0.0` and `--plms` (added by Katherine Crowson, see [Pseudo Numerical Methods for Diffusion Models on Manifolds](https://arxiv.org/abs/2202.09778)).

	#### Beyond 256²

	For certain inputs, simply running the model in a convolutional fashion on larger features than it was trained on
	can sometimes result in interesting results. To try it out, tune the `H` and `W` arguments (which will be integer-divided
	by 8 in order to calculate the corresponding latent size), e.g. run

	```
	python scripts/txt2img.py --prompt "a sunset behind a mountain range, vector image" --ddim_eta 1.0 --n_samples 1 --n_iter 1 --H 384 --W 1024 --scale 5.0
	```
	to create a sample of size 384x1024. Note, however, that controllability is reduced compared to the 256x256 setting.

	The example below was generated using the above command.
	![text2img-figure-conv](assets/txt2img-convsample.png)



	## Inpainting
	![inpainting](assets/inpainting.png)

	Download the pre-trained weights
	```
	wget -O models/ldm/inpainting_big/last.ckpt https://heibox.uni-heidelberg.de/f/4d9ac7ea40c64582b7c9/?dl=1
	```

	and sample with
	```
	python scripts/inpaint.py --indir data/inpainting_examples/ --outdir outputs/inpainting_results
	```
	`indir` should contain images `*.png` and masks `<image_fname>_mask.png` like
	the examples provided in `data/inpainting_examples`.

	## Class-Conditional ImageNet

	Available via a [notebook](scripts/latent_imagenet_diffusion.ipynb) [![][colab]][colab-cin].
	![class-conditional](assets/birdhouse.png)

	[colab]: <https://colab.research.google.com/assets/colab-badge.svg>
	[colab-cin]: <https://colab.research.google.com/github/CompVis/latent-diffusion/blob/main/scripts/latent_imagenet_diffusion.ipynb>


	## Unconditional Models

	We also provide a script for sampling from unconditional LDMs (e.g. LSUN, FFHQ, ...). Start it via

	```shell script
	CUDA_VISIBLE_DEVICES=<GPU_ID> python scripts/sample_diffusion.py -r models/ldm/<model_spec>/model.ckpt -l <logdir> -n <\#samples> --batch_size <batch_size> -c <\#ddim steps> -e <\#eta>
	```

	# Train your own LDMs

	## Data preparation

	### Faces
	For downloading the CelebA-HQ and FFHQ datasets, proceed as described in the [taming-transformers](https://github.com/CompVis/taming-transformers#celeba-hq)
	repository.

	### LSUN

	The LSUN datasets can be conveniently downloaded via the script available [here](https://github.com/fyu/lsun).
	We performed a custom split into training and validation images, and provide the corresponding filenames
	at [https://ommer-lab.com/files/lsun.zip](https://ommer-lab.com/files/lsun.zip).
	After downloading, extract them to `./data/lsun`. The beds/cats/churches subsets should
	also be placed/symlinked at `./data/lsun/bedrooms`/`./data/lsun/cats`/`./data/lsun/churches`, respectively.

	### ImageNet
	The code will try to download (through [Academic
	Torrents](http://academictorrents.com/)) and prepare ImageNet the first time it
	is used. However, since ImageNet is quite large, this requires a lot of disk
	space and time. If you already have ImageNet on your disk, you can speed things
	up by putting the data into
	`${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/` (which defaults to
	`~/.cache/autoencoders/data/ILSVRC2012_{split}/data/`), where `{split}` is one
	of `train`/`validation`. It should have the following structure:

	```
	${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/
	├── n01440764
	│ ├── n01440764_10026.JPEG
	│ ├── n01440764_10027.JPEG
	│ ├── ...
	├── n01443537
	│ ├── n01443537_10007.JPEG
	│ ├── n01443537_10014.JPEG
	│ ├── ...
	├── ...
	```

	If you haven't extracted the data, you can also place
	`ILSVRC2012_img_train.tar`/`ILSVRC2012_img_val.tar` (or symlinks to them) into
	`${XDG_CACHE}/autoencoders/data/ILSVRC2012_train/` /
	`${XDG_CACHE}/autoencoders/data/ILSVRC2012_validation/`, which will then be
	extracted into above structure without downloading it again. Note that this
	will only happen if neither a folder
	`${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/` nor a file
	`${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/.ready` exist. Remove them
	if you want to force running the dataset preparation again.


	## Model Training

	Logs and checkpoints for trained models are saved to `logs/<START_DATE_AND_TIME>_<config_spec>`.

	### Training autoencoder models

	Configs for training a KL-regularized autoencoder on ImageNet are provided at `configs/autoencoder`.
	Training can be started by running
	```
	CUDA_VISIBLE_DEVICES=<GPU_ID> python main.py --base configs/autoencoder/<config_spec>.yaml -t --gpus 0,
	```
	where `config_spec` is one of {`autoencoder_kl_8x8x64`(f=32, d=64), `autoencoder_kl_16x16x16`(f=16, d=16),
	`autoencoder_kl_32x32x4`(f=8, d=4), `autoencoder_kl_64x64x3`(f=4, d=3)}.

	For training VQ-regularized models, see the [taming-transformers](https://github.com/CompVis/taming-transformers)
	repository.

	### Training LDMs

	In ``configs/latent-diffusion/`` we provide configs for training LDMs on the LSUN-, CelebA-HQ, FFHQ and ImageNet datasets.
	Training can be started by running

	```shell script
	CUDA_VISIBLE_DEVICES=<GPU_ID> python main.py --base configs/latent-diffusion/<config_spec>.yaml -t --gpus 0,
	```

	where ``<config_spec>`` is one of {`celebahq-ldm-vq-4`(f=4, VQ-reg. autoencoder, spatial size 64x64x3),`ffhq-ldm-vq-4`(f=4, VQ-reg. autoencoder, spatial size 64x64x3),
	`lsun_bedrooms-ldm-vq-4`(f=4, VQ-reg. autoencoder, spatial size 64x64x3),
	`lsun_churches-ldm-vq-4`(f=8, KL-reg. autoencoder, spatial size 32x32x4),`cin-ldm-vq-8`(f=8, VQ-reg. autoencoder, spatial size 32x32x4)}.

	# Model Zoo

	## Pretrained Autoencoding Models
	![rec2](assets/reconstruction2.png)

	All models were trained until convergence (no further substantial improvement in rFID).

	\| Model \| rFID vs val \| train steps \|PSNR \| PSIM \| Link \| Comments
	\|-------------------------\|------------\|----------------\|----------------\|---------------\|-------------------------------------------------------------------------------------------------------------------------------------------------------\|-----------------------\|
	\| f=4, VQ (Z=8192, d=3) \| 0.58 \| 533066 \| 27.43 +/- 4.26 \| 0.53 +/- 0.21 \| https://ommer-lab.com/files/latent-diffusion/vq-f4.zip \| \|
	\| f=4, VQ (Z=8192, d=3) \| 1.06 \| 658131 \| 25.21 +/- 4.17 \| 0.72 +/- 0.26 \| https://heibox.uni-heidelberg.de/f/9c6681f64bb94338a069/?dl=1 \| no attention \|
	\| f=8, VQ (Z=16384, d=4) \| 1.14 \| 971043 \| 23.07 +/- 3.99 \| 1.17 +/- 0.36 \| https://ommer-lab.com/files/latent-diffusion/vq-f8.zip \| \|
	\| f=8, VQ (Z=256, d=4) \| 1.49 \| 1608649 \| 22.35 +/- 3.81 \| 1.26 +/- 0.37 \| https://ommer-lab.com/files/latent-diffusion/vq-f8-n256.zip \|
	\| f=16, VQ (Z=16384, d=8) \| 5.15 \| 1101166 \| 20.83 +/- 3.61 \| 1.73 +/- 0.43 \| https://heibox.uni-heidelberg.de/f/0e42b04e2e904890a9b6/?dl=1 \| \|
	\| \| \| \| \| \| \| \|
	\| f=4, KL \| 0.27 \| 176991 \| 27.53 +/- 4.54 \| 0.55 +/- 0.24 \| https://ommer-lab.com/files/latent-diffusion/kl-f4.zip \| \|
	\| f=8, KL \| 0.90 \| 246803 \| 24.19 +/- 4.19 \| 1.02 +/- 0.35 \| https://ommer-lab.com/files/latent-diffusion/kl-f8.zip \| \|
	\| f=16, KL (d=16) \| 0.87 \| 442998 \| 24.08 +/- 4.22 \| 1.07 +/- 0.36 \| https://ommer-lab.com/files/latent-diffusion/kl-f16.zip \| \|
	\| f=32, KL (d=64) \| 2.04 \| 406763 \| 22.27 +/- 3.93 \| 1.41 +/- 0.40 \| https://ommer-lab.com/files/latent-diffusion/kl-f32.zip \| \|

	### Get the models

	Running the following script downloads und extracts all available pretrained autoencoding models.
	```shell script
	bash scripts/download_first_stages.sh
	```

	The first stage models can then be found in `models/first_stage_models/<model_spec>`



	## Pretrained LDMs
	\| Datset \| Task \| Model \| FID \| IS \| Prec \| Recall \| Link \| Comments
	\|---------------------------------\|------\|--------------\|---------------\|-----------------\|------\|------\|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|-------------------------------------------------\|
	\| CelebA-HQ \| Unconditional Image Synthesis \| LDM-VQ-4 (200 DDIM steps, eta=0)\| 5.11 (5.11) \| 3.29 \| 0.72 \| 0.49 \| https://ommer-lab.com/files/latent-diffusion/celeba.zip \| \|
	\| FFHQ \| Unconditional Image Synthesis \| LDM-VQ-4 (200 DDIM steps, eta=1)\| 4.98 (4.98) \| 4.50 (4.50) \| 0.73 \| 0.50 \| https://ommer-lab.com/files/latent-diffusion/ffhq.zip \| \|
	\| LSUN-Churches \| Unconditional Image Synthesis \| LDM-KL-8 (400 DDIM steps, eta=0)\| 4.02 (4.02) \| 2.72 \| 0.64 \| 0.52 \| https://ommer-lab.com/files/latent-diffusion/lsun_churches.zip \| \|
	\| LSUN-Bedrooms \| Unconditional Image Synthesis \| LDM-VQ-4 (200 DDIM steps, eta=1)\| 2.95 (3.0) \| 2.22 (2.23)\| 0.66 \| 0.48 \| https://ommer-lab.com/files/latent-diffusion/lsun_bedrooms.zip \| \|
	\| ImageNet \| Class-conditional Image Synthesis \| LDM-VQ-8 (200 DDIM steps, eta=1) \| 7.77(7.76)* /15.82** \| 201.56(209.52)* /78.82** \| 0.84* / 0.65** \| 0.35* / 0.63** \| https://ommer-lab.com/files/latent-diffusion/cin.zip \| : w/ guiding, classifier_scale 10 *: w/o guiding, scores in bracket calculated with script provided by [ADM](https://github.com/openai/guided-diffusion) \|
	\| Conceptual Captions \| Text-conditional Image Synthesis \| LDM-VQ-f4 (100 DDIM steps, eta=0) \| 16.79 \| 13.89 \| N/A \| N/A \| https://ommer-lab.com/files/latent-diffusion/text2img.zip \| finetuned from LAION \|
	\| OpenImages \| Super-resolution \| LDM-VQ-4 \| N/A \| N/A \| N/A \| N/A \| https://ommer-lab.com/files/latent-diffusion/sr_bsr.zip \| BSR image degradation \|
	\| OpenImages \| Layout-to-Image Synthesis \| LDM-VQ-4 (200 DDIM steps, eta=0) \| 32.02 \| 15.92 \| N/A \| N/A \| https://ommer-lab.com/files/latent-diffusion/layout2img_model.zip \| \|
	\| Landscapes \| Semantic Image Synthesis \| LDM-VQ-4 \| N/A \| N/A \| N/A \| N/A \| https://ommer-lab.com/files/latent-diffusion/semantic_synthesis256.zip \| \|
	\| Landscapes \| Semantic Image Synthesis \| LDM-VQ-4 \| N/A \| N/A \| N/A \| N/A \| https://ommer-lab.com/files/latent-diffusion/semantic_synthesis.zip \| finetuned on resolution 512x512 \|


	### Get the models

	The LDMs listed above can jointly be downloaded and extracted via

	```shell script
	bash scripts/download_models.sh
	```

	The models can then be found in `models/ldm/<model_spec>`.



	## Coming Soon...

	* More inference scripts for conditional LDMs.
	* In the meantime, you can play with our colab notebook https://colab.research.google.com/drive/1xqzUi2iXQXDqXBHQGP9Mqt2YrYW6cx-J?usp=sharing

	## Comments

	- Our codebase for the diffusion models builds heavily on [OpenAI's ADM codebase](https://github.com/openai/guided-diffusion)
	and [https://github.com/lucidrains/denoising-diffusion-pytorch](https://github.com/lucidrains/denoising-diffusion-pytorch).
	Thanks for open-sourcing!

	- The implementation of the transformer encoder is from [x-transformers](https://github.com/lucidrains/x-transformers) by [lucidrains](https://github.com/lucidrains?tab=repositories).


	## BibTeX

	```
	@misc{rombach2021highresolution,
	title={High-Resolution Image Synthesis with Latent Diffusion Models},
	author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
	year={2021},
	eprint={2112.10752},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
	}
	```