LLaVA-OneVision-1.5-4B-stage0 / README.md

Yin-Xie

Create README.md

938c723 verified about 1 month ago

preview code

raw

history blame

1.64 kB

metadata

license: apache-2.0
base_model:
  - DeepGlint-AI/rice-vit-large-patch14-560
  - Qwen/Qwen3-4B-Instruct-2507

LLaVA-OneVision-1.5-8B Initialization Model Card

🚀 Overview

This model provides an initialization checkpoint for training LLaVA-OneVision-1.5, designed to combine strong language and vision capabilities. It integrates a powerful LLM and a state-of-the-art vision encoder, with a flexible adapter to enable efficient multimodal learning.

🏗️ Key Components

Vision Encoder:
Uses the pretrained ViT model from DeepGlint-AI/rice-vit-large-patch14-560 to extract rich visual features.
Adapter:
A randomly initialized adapter module with 4× token compression capability, enabling efficient fusion of image and text modalities.
Language Model:
Incorporates the pretrained language model Qwen/Qwen3-4B-Instruct-2507 for robust text understanding and generation.

📝 Usage

This initialization checkpoint is intended for downstream training and fine-tuning. For usage and training scripts, please refer to the EvolvingLMMs-Lab/LLaVA-OneVision-1.5 repository.

📚 References

⚖️ License

Apache 2.0