license: apache-2.0
base_model:
  - DeepGlint-AI/rice-vit-large-patch14-560
  - Qwen/Qwen3-4B-Instruct-2507
LLaVA-OneVision-1.5-8B Initialization Model Card
π Overview
This model provides an initialization checkpoint for training LLaVA-OneVision-1.5, designed to combine strong language and vision capabilities. It integrates a powerful LLM and a state-of-the-art vision encoder, with a flexible adapter to enable efficient multimodal learning.
ποΈ Key Components
- Vision Encoder: 
 Uses the pretrained ViT model from DeepGlint-AI/rice-vit-large-patch14-560 to extract rich visual features.
- Adapter: 
 A randomly initialized adapter module with 4Γ token compression capability, enabling efficient fusion of image and text modalities.
- Language Model: 
 Incorporates the pretrained language model Qwen/Qwen3-4B-Instruct-2507 for robust text understanding and generation.
π Usage
This initialization checkpoint is intended for downstream training and fine-tuning. For usage and training scripts, please refer to the EvolvingLMMs-Lab/LLaVA-OneVision-1.5 repository.
π References
- DeepGlint-AI/rice-vit-large-patch14-560
- Qwen/Qwen3-4B-Instruct-2507
- EvolvingLMMs-Lab/LLaVA-OneVision-1.5
Citation
If you find LLaVA-OneVision-1.5 useful in your research, please consider to cite the following related papers:
@misc{an2025llavaonevision15fullyopenframework,
      title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training}, 
      author={Xiang An and Yin Xie and Kaicheng Yang and Wenkang Zhang and Xiuwei Zhao and Zheng Cheng and Yirui Wang and Songcen Xu and Changrui Chen and Chunsheng Wu and Huajie Tan and Chunyuan Li and Jing Yang and Jie Yu and Xiyao Wang and Bin Qin and Yumeng Wang and Zizhen Yan and Ziyong Feng and Ziwei Liu and Bo Li and Jiankang Deng},
      year={2025},
      eprint={2509.23661},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.23661}, 
}
βοΈ License
Apache 2.0

