---
library_name: diffusers
license: apache-2.0
pipeline_tag: any-to-any
---
# Show-o2: Improved Native Unified Multimodal Models
This repository contains the Show-o2 models, which are improved native unified multimodal models as presented in the paper [Show-o2: Improved Native Unified Multimodal Models](https://huggingface.co/papers/2506.15564).
Show-o2 leverages autoregressive modeling and flow matching, built upon a 3D causal variational autoencoder space. Unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation.
The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos.
## Model Details
### Model Description
This model card describes **Show-o2**, an improved native unified multimodal model that leverages autoregressive modeling and flow matching, built upon a 3D causal variational autoencoder space. It enables scalable handling of text, image, and video modalities for both multimodal understanding and generation.
- **Developed by:** [Show Lab](https://sites.google.com/view/showlab) (National University of Singapore) and Bytedance
- **Funded by [optional]:** [More Information Needed]
- **Shared by [optional]:** [More Information Needed]
- **Model type:** Unified Multimodal Model
- **Language(s) (NLP):** English (for text understanding and generation components)
- **License:** apache-2.0
- **Finetuned from model [optional]:** Uses [microsoft/phi-1_5](https://huggingface.co/microsoft/phi-1_5) as a language model component.
### Model Sources
- **Repository:** [https://github.com/showlab/Show-o/tree/main/show-o2](https://github.com/showlab/Show-o/tree/main/show-o2)
- **Paper:** [Show-o2: Improved Native Unified Multimodal Models](https://huggingface.co/papers/2506.15564)
- **Demo:** [Hugging Face Space](https://huggingface.co/spaces/showlab/Show-o)
## Uses
### Direct Use
Show-o2 can be directly used for a wide range of multimodal tasks including image captioning, visual question answering (VQA), text-to-image generation, text-guided image inpainting, and extrapolation.
### Downstream Use [optional]
[More Information Needed]
### Out-of-Scope Use
[More Information Needed]
## Bias, Risks, and Limitations
[More Information Needed]
### Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
## How to Get Started with the Model
Use the code below to get started with the model.
First, set up the environment:
```bash
pip3 install -r requirements.txt
```
Login your wandb account on your machine or server:
```bash
wandb login
```
### Multimodal Understanding (MMU) Inference
```python
python3 inference_mmu.py config=configs/showo_demo_w_clip_vit_512x512.yaml \
max_new_tokens=100 \
mmu_image_root=./mmu_validation question='Please describe this image in detail. *** Do you think the image is unusual or not?'
```
### Text-to-Image Generation Inference
```python
python3 inference_t2i.py config=configs/showo_demo_512x512.yaml \
batch_size=1 validation_prompts_file=validation_prompts/showoprompts.txt \
guidance_scale=5 generation_timesteps=50 \
mode='t2i'
```
### Text-guided Inpainting Inference
```python
python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=1 \
guidance_scale=1.75 generation_timesteps=16 \
mode='inpainting' prompt='A blue sports car with sleek curves and tinted windows, parked on a bustling city street.' \
image_path=./inpainting_validation/bus.jpg inpainting_mask_path=./inpainting_validation/bus_mask.webp
```
### Text-guided Extrapolation Inference
```python
python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=1 \
guidance_scale=1.75 generation_timesteps=16 \
mode='extrapolation' extra_direction='left *** left *** left *** right *** right *** right' offset=0 prompt='a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees.' \
image_path=./inpainting_validation/alpine_lake.jpg
```
## Training Details
### Training Data
Show-o2 models are trained on a combination of datasets including ImageNet-1K, large-scale Image-Text datasets, high-quality Image-Text datasets, and LLaVA datasets for instruction tuning.
### Training Procedure
#### Preprocessing [optional]
[More Information Needed]
#### Training Hyperparameters
- **Training regime:** A two-stage training recipe is designed to effectively learn and scale to larger models.
#### Speeds, Sizes, Times [optional]
[More Information Needed]
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
[More Information Needed]
#### Factors
[More Information Needed]
#### Metrics
[More Information Needed]
### Results
[More Information Needed]
#### Summary
## Model Examination [optional]
[More Information Needed]
## Environmental Impact
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware Type:** [More Information Needed]
- **Hours used:** [More Information Needed]
- **Cloud Provider:** [More Information Needed]
- **Compute Region:** [More Information Needed]
- **Carbon Emitted:** [More Information Needed]
## Technical Specifications [optional]
### Model Architecture and Objective
Show-o2 leverages autoregressive modeling and flow matching, built upon a 3D causal variational autoencoder space. Unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation.
### Compute Infrastructure
[More Information Needed]
#### Hardware
[More Information Needed]
#### Software
[More Information Needed]
## Citation
**BibTeX:**
```bibtex
@article{xie2025showo2,
title={Show-o2: Improved Native Unified Multimodal Models},
author={Xie, Jinheng and Yang, Zhenheng and Shou, Mike Zheng},
journal={arXiv preprint arXiv:2506.15564},
year={2025}
}
```
**APA:**
[More Information Needed]
## Glossary [optional]
[More Information Needed]
## More Information [optional]
[More Information Needed]
## Model Card Authors [optional]
[More Information Needed]
## Model Card Contact
[More Information Needed]