Add descriptive tags to the model card
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
@@ -1,18 +1,23 @@
|
|
1 |
---
|
2 |
-
license: apache-2.0
|
3 |
-
pipeline_tag: image-text-to-text
|
4 |
-
library_name: transformers
|
5 |
base_model:
|
6 |
-
|
7 |
-
base_model_relation: finetune
|
8 |
datasets:
|
9 |
-
|
10 |
-
|
11 |
language:
|
12 |
-
|
|
|
|
|
|
|
13 |
tags:
|
14 |
-
|
15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
---
|
17 |
|
18 |
# InternVL3_5-2B-Instruct
|
@@ -27,7 +32,7 @@ tags:
|
|
27 |
|
28 |
## Introduction
|
29 |
|
30 |
-
We introduce *InternVL3.5*, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the *Cascade Reinforcement Learning (Cascade RL)* framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a *Visual Resolution Router (ViR)* that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled *Vision-Language Deployment (DvD)* strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks
|
31 |
|
32 |

|
33 |
|
@@ -529,40 +534,50 @@ generation_config = dict(max_new_tokens=1024, do_sample=True)
|
|
529 |
# pure-text conversation (纯文本对话)
|
530 |
question = 'Hello, who are you?'
|
531 |
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
|
532 |
-
print(f'User: {question}
|
|
|
533 |
|
534 |
question = 'Can you tell me a story?'
|
535 |
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
|
536 |
-
print(f'User: {question}
|
|
|
537 |
|
538 |
# single-image single-round conversation (单图单轮对话)
|
539 |
-
question = '<image
|
|
|
540 |
response = model.chat(tokenizer, pixel_values, question, generation_config)
|
541 |
-
print(f'User: {question}
|
|
|
542 |
|
543 |
# single-image multi-round conversation (单图多轮对话)
|
544 |
-
question = '<image
|
|
|
545 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
|
546 |
-
print(f'User: {question}
|
|
|
547 |
|
548 |
question = 'Please write a poem according to the image.'
|
549 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
|
550 |
-
print(f'User: {question}
|
|
|
551 |
|
552 |
# multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
|
553 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
554 |
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
|
555 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
556 |
|
557 |
-
question = '<image
|
|
|
558 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
559 |
history=None, return_history=True)
|
560 |
-
print(f'User: {question}
|
|
|
561 |
|
562 |
question = 'What are the similarities and differences between these two images.'
|
563 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
564 |
history=history, return_history=True)
|
565 |
-
print(f'User: {question}
|
|
|
566 |
|
567 |
# multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
|
568 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
@@ -570,17 +585,20 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
|
|
570 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
571 |
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
|
572 |
|
573 |
-
question = 'Image-1: <image
|
|
|
|
|
574 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
575 |
num_patches_list=num_patches_list,
|
576 |
history=None, return_history=True)
|
577 |
-
print(f'User: {question}
|
|
|
578 |
|
579 |
question = 'What are the similarities and differences between these two images.'
|
580 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
581 |
-
num_patches_list=num_patches_list,
|
582 |
-
|
583 |
-
|
584 |
|
585 |
# batch inference, single image per sample (单图批处理)
|
586 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
@@ -588,13 +606,15 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
|
|
588 |
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
|
589 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
590 |
|
591 |
-
questions = ['<image
|
|
|
592 |
responses = model.batch_chat(tokenizer, pixel_values,
|
593 |
num_patches_list=num_patches_list,
|
594 |
questions=questions,
|
595 |
generation_config=generation_config)
|
596 |
for question, response in zip(questions, responses):
|
597 |
-
print(f'User: {question}
|
|
|
598 |
|
599 |
# video multi-round conversation (视频多轮对话)
|
600 |
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
|
@@ -632,17 +652,24 @@ def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=3
|
|
632 |
video_path = './examples/red-panda.mp4'
|
633 |
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
|
634 |
pixel_values = pixel_values.to(torch.bfloat16).cuda()
|
635 |
-
video_prefix = ''.join([f'Frame{i+1}: <image
|
|
|
636 |
question = video_prefix + 'What is the red panda doing?'
|
637 |
-
# Frame1: <image
|
|
|
|
|
|
|
|
|
638 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
639 |
num_patches_list=num_patches_list, history=None, return_history=True)
|
640 |
-
print(f'User: {question}
|
|
|
641 |
|
642 |
question = 'Describe this video in detail.'
|
643 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
644 |
num_patches_list=num_patches_list, history=history, return_history=True)
|
645 |
-
print(f'User: {question}
|
|
|
646 |
```
|
647 |
|
648 |
#### Streaming Output
|
@@ -726,7 +753,9 @@ image_urls=[
|
|
726 |
|
727 |
images = [load_image(img_url) for img_url in image_urls]
|
728 |
# Numbering images improves multi-image conversations
|
729 |
-
response = pipe((f'Image-1: {IMAGE_TOKEN}
|
|
|
|
|
730 |
print(response.text)
|
731 |
```
|
732 |
|
@@ -829,3 +858,14 @@ If you find this project useful in your research, please consider citing:
|
|
829 |
year={2025}
|
830 |
}
|
831 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
|
|
|
|
|
|
2 |
base_model:
|
3 |
+
- OpenGVLab/InternVL3_5-2B-Pretrained
|
|
|
4 |
datasets:
|
5 |
+
- OpenGVLab/MMPR-v1.2
|
6 |
+
- OpenGVLab/MMPR-Tiny
|
7 |
language:
|
8 |
+
- multilingual
|
9 |
+
library_name: transformers
|
10 |
+
license: apache-2.0
|
11 |
+
pipeline_tag: image-text-to-text
|
12 |
tags:
|
13 |
+
- internvl
|
14 |
+
- custom_code
|
15 |
+
- multimodal
|
16 |
+
- vlm
|
17 |
+
- reasoning
|
18 |
+
- agent
|
19 |
+
- qwen3
|
20 |
+
base_model_relation: finetune
|
21 |
---
|
22 |
|
23 |
# InternVL3_5-2B-Instruct
|
|
|
32 |
|
33 |
## Introduction
|
34 |
|
35 |
+
We introduce *InternVL3.5*, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the *Cascade Reinforcement Learning (Cascade RL)* framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a *Visual Resolution Router (ViR)* that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled *Vision-Language Deployment (DvD)* strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks -- narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
|
36 |
|
37 |

|
38 |
|
|
|
534 |
# pure-text conversation (纯文本对话)
|
535 |
question = 'Hello, who are you?'
|
536 |
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
|
537 |
+
print(f'User: {question}
|
538 |
+
Assistant: {response}')
|
539 |
|
540 |
question = 'Can you tell me a story?'
|
541 |
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
|
542 |
+
print(f'User: {question}
|
543 |
+
Assistant: {response}')
|
544 |
|
545 |
# single-image single-round conversation (单图单轮对话)
|
546 |
+
question = '<image>
|
547 |
+
Please describe the image shortly.'
|
548 |
response = model.chat(tokenizer, pixel_values, question, generation_config)
|
549 |
+
print(f'User: {question}
|
550 |
+
Assistant: {response}')
|
551 |
|
552 |
# single-image multi-round conversation (单图多轮对话)
|
553 |
+
question = '<image>
|
554 |
+
Please describe the image in detail.'
|
555 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
|
556 |
+
print(f'User: {question}
|
557 |
+
Assistant: {response}')
|
558 |
|
559 |
question = 'Please write a poem according to the image.'
|
560 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
|
561 |
+
print(f'User: {question}
|
562 |
+
Assistant: {response}')
|
563 |
|
564 |
# multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
|
565 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
566 |
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
|
567 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
568 |
|
569 |
+
question = '<image>
|
570 |
+
Describe the two images in detail.'
|
571 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
572 |
history=None, return_history=True)
|
573 |
+
print(f'User: {question}
|
574 |
+
Assistant: {response}')
|
575 |
|
576 |
question = 'What are the similarities and differences between these two images.'
|
577 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
578 |
history=history, return_history=True)
|
579 |
+
print(f'User: {question}
|
580 |
+
Assistant: {response}')
|
581 |
|
582 |
# multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
|
583 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
|
|
585 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
586 |
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
|
587 |
|
588 |
+
question = 'Image-1: <image>
|
589 |
+
Image-2: <image>
|
590 |
+
Describe the two images in detail.'
|
591 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
592 |
num_patches_list=num_patches_list,
|
593 |
history=None, return_history=True)
|
594 |
+
print(f'User: {question}
|
595 |
+
Assistant: {response}')
|
596 |
|
597 |
question = 'What are the similarities and differences between these two images.'
|
598 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
599 |
+
num_patches_list=num_patches_list, history=history, return_history=True)
|
600 |
+
print(f'User: {question}
|
601 |
+
Assistant: {response}')
|
602 |
|
603 |
# batch inference, single image per sample (单图批处理)
|
604 |
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
|
|
|
606 |
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
|
607 |
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
|
608 |
|
609 |
+
questions = ['<image>
|
610 |
+
Describe the image in detail.'] * len(num_patches_list)
|
611 |
responses = model.batch_chat(tokenizer, pixel_values,
|
612 |
num_patches_list=num_patches_list,
|
613 |
questions=questions,
|
614 |
generation_config=generation_config)
|
615 |
for question, response in zip(questions, responses):
|
616 |
+
print(f'User: {question}
|
617 |
+
Assistant: {response}')
|
618 |
|
619 |
# video multi-round conversation (视频多轮对话)
|
620 |
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
|
|
|
652 |
video_path = './examples/red-panda.mp4'
|
653 |
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
|
654 |
pixel_values = pixel_values.to(torch.bfloat16).cuda()
|
655 |
+
video_prefix = ''.join([f'Frame{i+1}: <image>
|
656 |
+
' for i in range(len(num_patches_list))])
|
657 |
question = video_prefix + 'What is the red panda doing?'
|
658 |
+
# Frame1: <image>
|
659 |
+
Frame2: <image>
|
660 |
+
...
|
661 |
+
Frame8: <image>
|
662 |
+
{question}
|
663 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
664 |
num_patches_list=num_patches_list, history=None, return_history=True)
|
665 |
+
print(f'User: {question}
|
666 |
+
Assistant: {response}')
|
667 |
|
668 |
question = 'Describe this video in detail.'
|
669 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
670 |
num_patches_list=num_patches_list, history=history, return_history=True)
|
671 |
+
print(f'User: {question}
|
672 |
+
Assistant: {response}')
|
673 |
```
|
674 |
|
675 |
#### Streaming Output
|
|
|
753 |
|
754 |
images = [load_image(img_url) for img_url in image_urls]
|
755 |
# Numbering images improves multi-image conversations
|
756 |
+
response = pipe((f'Image-1: {IMAGE_TOKEN}
|
757 |
+
Image-2: {IMAGE_TOKEN}
|
758 |
+
describe these two images', images))
|
759 |
print(response.text)
|
760 |
```
|
761 |
|
|
|
858 |
year={2025}
|
859 |
}
|
860 |
```
|
861 |
+
|
862 |
+
|
863 |
+
## Acknowledgement
|
864 |
+
|
865 |
+
InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
|
866 |
+
|
867 |
+
______________________________________________________________________
|
868 |
+
|
869 |
+
Scan the following QR Code, join our WeChat group.
|
870 |
+
|
871 |
+
<p align="center"><img width="300" alt="image" src="https://github.com/user-attachments/assets/f776df09-ebba-4fd5-80c2-fec4ff1518be"></p>
|