AndesVL-4B-Thinking

    GitHub

AndesVL is a suite of mobile-optimized Multimodal Large Language Models (MLLMs) with 0.6B to 4B parameters, built upon Qwen3's LLM and various visual encoders. Designed for efficient edge deployment, it achieves first-tier performance on diverse benchmarks, including those for text-rich tasks, reasoning tasks, Visual Question Answering (VQA), multi-image tasks, multilingual tasks, and GUI tasks. Its "1+N" LoRA architecture and QALFT framework facilitate efficient task adaptation and model compression, enabling a 6.7x peak decoding speedup and a 1.8 bits-per-weight compression ratio on mobile chips.

Detailed model sizes and components are provided below:

Model Total Parameters (B) Visual Encoder LLM
AndesVL-0.6B 0.695 SigLIP2-Base Qwen3-0.6B
AndesVL-1B 0.927 AIMv2-Large Qwen3-0.6B
AndesVL-2B 2.055 AIMv2-Large Qwen3-1.7B
AndesVL-4B 4.360 AIMv2-Large Qwen3-4B

Quick Start

# require transformers>=4.52.4


import torch
from transformers import AutoModel, AutoTokenizer, CLIPImageProcessor

model_dir = "OPPOer/AndesVL-4B-Thinking"

model = AutoModel.from_pretrained(model_dir, trust_remote_code=True,torch_dtype=torch.bfloat16).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
image_processor = CLIPImageProcessor.from_pretrained(model_dir, trust_remote_code=True)

messages = [
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": "描述这张图片。"},
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": "https://i-blog.csdnimg.cn/blog_migrate/2f4c88e71f7eabe46d062d2f1ec77d10.jpeg" # image/to/path
                            },
                        }
                    ],
                },
        ]
res = model.chat(messages, tokenizer, image_processor, max_new_tokens=1024, do_sample=True, temperature=0.6, Thinking=True)
print(res)

Citation

If you find our work helpful, feel free to give us a cite.

@misc{jin2025andesvltechnicalreportefficient,
      title={AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model}, 
      author={AndesVL Team, OPPO AI Center},
      year={2025},
      eprint={2510.11496},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.11496}, 
}

Acknowledge

We are very grateful for the efforts of the Qwen, AimV2 and Siglip 2 projects.

Downloads last month
84
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including OPPOer/AndesVL-4B-Thinking