license: apache-2.0
base_model:
- DeepGlint-AI/rice-vit-large-patch14-560
- Qwen/Qwen3-4B-Instruct-2507
LLaVA-OneVision-1.5-8B Initialization Model Card
π Overview
This model provides an initialization checkpoint for training LLaVA-OneVision-1.5, designed to combine strong language and vision capabilities. It integrates a powerful LLM and a state-of-the-art vision encoder, with a flexible adapter to enable efficient multimodal learning.
ποΈ Key Components
Vision Encoder:
Uses the pretrained ViT model from DeepGlint-AI/rice-vit-large-patch14-560 to extract rich visual features.Adapter:
A randomly initialized adapter module with 4Γ token compression capability, enabling efficient fusion of image and text modalities.Language Model:
Incorporates the pretrained language model Qwen/Qwen3-4B-Instruct-2507 for robust text understanding and generation.
π Usage
This initialization checkpoint is intended for downstream training and fine-tuning. For usage and training scripts, please refer to the EvolvingLMMs-Lab/LLaVA-OneVision-1.5 repository.
π References
- DeepGlint-AI/rice-vit-large-patch14-560
- Qwen/Qwen3-4B-Instruct-2507
- EvolvingLMMs-Lab/LLaVA-OneVision-1.5
βοΈ License
Apache 2.0