Yin-Xie's picture
Create README.md
938c723 verified
|
raw
history blame
1.64 kB
metadata
license: apache-2.0
base_model:
  - DeepGlint-AI/rice-vit-large-patch14-560
  - Qwen/Qwen3-4B-Instruct-2507

LLaVA-OneVision-1.5-8B Initialization Model Card

πŸš€ Overview

This model provides an initialization checkpoint for training LLaVA-OneVision-1.5, designed to combine strong language and vision capabilities. It integrates a powerful LLM and a state-of-the-art vision encoder, with a flexible adapter to enable efficient multimodal learning.

πŸ—οΈ Key Components

  • Vision Encoder:
    Uses the pretrained ViT model from DeepGlint-AI/rice-vit-large-patch14-560 to extract rich visual features.

  • Adapter:
    A randomly initialized adapter module with 4Γ— token compression capability, enabling efficient fusion of image and text modalities.

  • Language Model:
    Incorporates the pretrained language model Qwen/Qwen3-4B-Instruct-2507 for robust text understanding and generation.

πŸ“ Usage

This initialization checkpoint is intended for downstream training and fine-tuning. For usage and training scripts, please refer to the EvolvingLMMs-Lab/LLaVA-OneVision-1.5 repository.

πŸ“š References

βš–οΈ License

Apache 2.0