---
library_name: diffusers
license: apache-2.0
pipeline_tag: any-to-any
---

# Show-o2: Improved Native Unified Multimodal Models

This repository contains the Show-o2 models, which are improved native unified multimodal models as presented in the paper [Show-o2: Improved Native Unified Multimodal Models](https://huggingface.co/papers/2506.15564).

<div align="center">
<img src="show-o2/docs/overview.png" width="800">
</div>

Show-o2 leverages autoregressive modeling and flow matching, built upon a 3D causal variational autoencoder space. Unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation.

The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos.

## Model Details

### Model Description

This model card describes **Show-o2**, an improved native unified multimodal model that leverages autoregressive modeling and flow matching, built upon a 3D causal variational autoencoder space. It enables scalable handling of text, image, and video modalities for both multimodal understanding and generation.

- **Developed by:** [Show Lab](https://sites.google.com/view/showlab) (National University of Singapore) and Bytedance
- **Funded by [optional]:** [More Information Needed]
- **Shared by [optional]:** [More Information Needed]
- **Model type:** Unified Multimodal Model
- **Language(s) (NLP):** English (for text understanding and generation components)
- **License:** apache-2.0
- **Finetuned from model [optional]:** Uses [microsoft/phi-1_5](https://huggingface.co/microsoft/phi-1_5) as a language model component.

### Model Sources

- **Repository:** [https://github.com/showlab/Show-o/tree/main/show-o2](https://github.com/showlab/Show-o/tree/main/show-o2)
- **Paper:** [Show-o2: Improved Native Unified Multimodal Models](https://huggingface.co/papers/2506.15564)
- **Demo:** [Hugging Face Space](https://huggingface.co/spaces/showlab/Show-o)

## Uses

### Direct Use

Show-o2 can be directly used for a wide range of multimodal tasks including image captioning, visual question answering (VQA), text-to-image generation, text-guided image inpainting, and extrapolation.

### Downstream Use [optional]

[More Information Needed]

### Out-of-Scope Use

[More Information Needed]

## Bias, Risks, and Limitations

[More Information Needed]

### Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

## How to Get Started with the Model

Use the code below to get started with the model.

First, set up the environment:
```bash
pip3 install -r requirements.txt
```
Login your wandb account on your machine or server:
```bash
wandb login <your wandb keys>
```

### Multimodal Understanding (MMU) Inference
```python
python3 inference_mmu.py config=configs/showo_demo_w_clip_vit_512x512.yaml \
max_new_tokens=100 \
mmu_image_root=./mmu_validation question='Please describe this image in detail. *** Do you think the image is unusual or not?'
```

### Text-to-Image Generation Inference
```python
python3 inference_t2i.py config=configs/showo_demo_512x512.yaml \
batch_size=1 validation_prompts_file=validation_prompts/showoprompts.txt \
guidance_scale=5 generation_timesteps=50 \
mode='t2i'
```

### Text-guided Inpainting Inference
```python
python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=1 \
guidance_scale=1.75 generation_timesteps=16 \
mode='inpainting' prompt='A blue sports car with sleek curves and tinted windows, parked on a bustling city street.' \
image_path=./inpainting_validation/bus.jpg inpainting_mask_path=./inpainting_validation/bus_mask.webp
```

### Text-guided Extrapolation Inference
```python
python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=1 \
guidance_scale=1.75 generation_timesteps=16 \
mode='extrapolation' extra_direction='left *** left *** left *** right *** right *** right' offset=0 prompt='a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees.' \
image_path=./inpainting_validation/alpine_lake.jpg
```

## Training Details

### Training Data

Show-o2 models are trained on a combination of datasets including ImageNet-1K, large-scale Image-Text datasets, high-quality Image-Text datasets, and LLaVA datasets for instruction tuning.

### Training Procedure

#### Preprocessing [optional]

[More Information Needed]

#### Training Hyperparameters

- **Training regime:** A two-stage training recipe is designed to effectively learn and scale to larger models.

#### Speeds, Sizes, Times [optional]

[More Information Needed]

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

[More Information Needed]

#### Factors

[More Information Needed]

#### Metrics

[More Information Needed]

### Results

[More Information Needed]

#### Summary

## Model Examination [optional]

[More Information Needed]

## Environmental Impact

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** [More Information Needed]
- **Hours used:** [More Information Needed]
- **Cloud Provider:** [More Information Needed]
- **Compute Region:** [More Information Needed]
- **Carbon Emitted:** [More Information Needed]

## Technical Specifications [optional]

### Model Architecture and Objective

Show-o2 leverages autoregressive modeling and flow matching, built upon a 3D causal variational autoencoder space. Unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation.

### Compute Infrastructure

[More Information Needed]

#### Hardware

[More Information Needed]

#### Software

[More Information Needed]

## Citation

**BibTeX:**
```bibtex
@article{xie2025showo2,
  title={Show-o2: Improved Native Unified Multimodal Models},
  author={Xie, Jinheng and Yang, Zhenheng and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2506.15564},
  year={2025}
}
```

**APA:**

[More Information Needed]

## Glossary [optional]

[More Information Needed]

## More Information [optional]

[More Information Needed]

## Model Card Authors [optional]

[More Information Needed]

## Model Card Contact

[More Information Needed]