--- library_name: diffusers license: apache-2.0 pipeline_tag: any-to-any --- # Show-o2: Improved Native Unified Multimodal Models This repository contains the Show-o2 models, which are improved native unified multimodal models as presented in the paper [Show-o2: Improved Native Unified Multimodal Models](https://huggingface.co/papers/2506.15564).
Show-o2 leverages autoregressive modeling and flow matching, built upon a 3D causal variational autoencoder space. Unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. ## Model Details ### Model Description This model card describes **Show-o2**, an improved native unified multimodal model that leverages autoregressive modeling and flow matching, built upon a 3D causal variational autoencoder space. It enables scalable handling of text, image, and video modalities for both multimodal understanding and generation. - **Developed by:** [Show Lab](https://sites.google.com/view/showlab) (National University of Singapore) and Bytedance - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Model type:** Unified Multimodal Model - **Language(s) (NLP):** English (for text understanding and generation components) - **License:** apache-2.0 - **Finetuned from model [optional]:** Uses [microsoft/phi-1_5](https://huggingface.co/microsoft/phi-1_5) as a language model component. ### Model Sources - **Repository:** [https://github.com/showlab/Show-o/tree/main/show-o2](https://github.com/showlab/Show-o/tree/main/show-o2) - **Paper:** [Show-o2: Improved Native Unified Multimodal Models](https://huggingface.co/papers/2506.15564) - **Demo:** [Hugging Face Space](https://huggingface.co/spaces/showlab/Show-o) ## Uses ### Direct Use Show-o2 can be directly used for a wide range of multimodal tasks including image captioning, visual question answering (VQA), text-to-image generation, text-guided image inpainting, and extrapolation. ### Downstream Use [optional] [More Information Needed] ### Out-of-Scope Use [More Information Needed] ## Bias, Risks, and Limitations [More Information Needed] ### Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. ## How to Get Started with the Model Use the code below to get started with the model. First, set up the environment: ```bash pip3 install -r requirements.txt ``` Login your wandb account on your machine or server: ```bash wandb login ``` ### Multimodal Understanding (MMU) Inference ```python python3 inference_mmu.py config=configs/showo_demo_w_clip_vit_512x512.yaml \ max_new_tokens=100 \ mmu_image_root=./mmu_validation question='Please describe this image in detail. *** Do you think the image is unusual or not?' ``` ### Text-to-Image Generation Inference ```python python3 inference_t2i.py config=configs/showo_demo_512x512.yaml \ batch_size=1 validation_prompts_file=validation_prompts/showoprompts.txt \ guidance_scale=5 generation_timesteps=50 \ mode='t2i' ``` ### Text-guided Inpainting Inference ```python python3 inference_t2i.py config=configs/showo_demo.yaml \ batch_size=1 \ guidance_scale=1.75 generation_timesteps=16 \ mode='inpainting' prompt='A blue sports car with sleek curves and tinted windows, parked on a bustling city street.' \ image_path=./inpainting_validation/bus.jpg inpainting_mask_path=./inpainting_validation/bus_mask.webp ``` ### Text-guided Extrapolation Inference ```python python3 inference_t2i.py config=configs/showo_demo.yaml \ batch_size=1 \ guidance_scale=1.75 generation_timesteps=16 \ mode='extrapolation' extra_direction='left *** left *** left *** right *** right *** right' offset=0 prompt='a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees.' \ image_path=./inpainting_validation/alpine_lake.jpg ``` ## Training Details ### Training Data Show-o2 models are trained on a combination of datasets including ImageNet-1K, large-scale Image-Text datasets, high-quality Image-Text datasets, and LLaVA datasets for instruction tuning. ### Training Procedure #### Preprocessing [optional] [More Information Needed] #### Training Hyperparameters - **Training regime:** A two-stage training recipe is designed to effectively learn and scale to larger models. #### Speeds, Sizes, Times [optional] [More Information Needed] ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data [More Information Needed] #### Factors [More Information Needed] #### Metrics [More Information Needed] ### Results [More Information Needed] #### Summary ## Model Examination [optional] [More Information Needed] ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** [More Information Needed] - **Hours used:** [More Information Needed] - **Cloud Provider:** [More Information Needed] - **Compute Region:** [More Information Needed] - **Carbon Emitted:** [More Information Needed] ## Technical Specifications [optional] ### Model Architecture and Objective Show-o2 leverages autoregressive modeling and flow matching, built upon a 3D causal variational autoencoder space. Unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. ### Compute Infrastructure [More Information Needed] #### Hardware [More Information Needed] #### Software [More Information Needed] ## Citation **BibTeX:** ```bibtex @article{xie2025showo2, title={Show-o2: Improved Native Unified Multimodal Models}, author={Xie, Jinheng and Yang, Zhenheng and Shou, Mike Zheng}, journal={arXiv preprint arXiv:2506.15564}, year={2025} } ``` **APA:** [More Information Needed] ## Glossary [optional] [More Information Needed] ## More Information [optional] [More Information Needed] ## Model Card Authors [optional] [More Information Needed] ## Model Card Contact [More Information Needed]