Add descriptive tags to the model card

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +73 -33
README.md CHANGED
@@ -1,18 +1,23 @@
1
  ---
2
- license: apache-2.0
3
- pipeline_tag: image-text-to-text
4
- library_name: transformers
5
  base_model:
6
- - OpenGVLab/InternVL3_5-2B-Pretrained
7
- base_model_relation: finetune
8
  datasets:
9
- - OpenGVLab/MMPR-v1.2
10
- - OpenGVLab/MMPR-Tiny
11
  language:
12
- - multilingual
 
 
 
13
  tags:
14
- - internvl
15
- - custom_code
 
 
 
 
 
 
16
  ---
17
 
18
  # InternVL3_5-2B-Instruct
@@ -27,7 +32,7 @@ tags:
27
 
28
  ## Introduction
29
 
30
- We introduce *InternVL3.5*, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the *Cascade Reinforcement Learning (Cascade RL)* framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a *Visual Resolution Router (ViR)* that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled *Vision-Language Deployment (DvD)* strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksnarrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
31
 
32
  ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance.jpg)
33
 
@@ -529,40 +534,50 @@ generation_config = dict(max_new_tokens=1024, do_sample=True)
529
  # pure-text conversation (纯文本对话)
530
  question = 'Hello, who are you?'
531
  response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
532
- print(f'User: {question}\nAssistant: {response}')
 
533
 
534
  question = 'Can you tell me a story?'
535
  response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
536
- print(f'User: {question}\nAssistant: {response}')
 
537
 
538
  # single-image single-round conversation (单图单轮对话)
539
- question = '<image>\nPlease describe the image shortly.'
 
540
  response = model.chat(tokenizer, pixel_values, question, generation_config)
541
- print(f'User: {question}\nAssistant: {response}')
 
542
 
543
  # single-image multi-round conversation (单图多轮对话)
544
- question = '<image>\nPlease describe the image in detail.'
 
545
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
546
- print(f'User: {question}\nAssistant: {response}')
 
547
 
548
  question = 'Please write a poem according to the image.'
549
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
550
- print(f'User: {question}\nAssistant: {response}')
 
551
 
552
  # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
553
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
554
  pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
555
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
556
 
557
- question = '<image>\nDescribe the two images in detail.'
 
558
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
559
  history=None, return_history=True)
560
- print(f'User: {question}\nAssistant: {response}')
 
561
 
562
  question = 'What are the similarities and differences between these two images.'
563
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
564
  history=history, return_history=True)
565
- print(f'User: {question}\nAssistant: {response}')
 
566
 
567
  # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
568
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
@@ -570,17 +585,20 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
570
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
571
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
572
 
573
- question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
 
 
574
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
575
  num_patches_list=num_patches_list,
576
  history=None, return_history=True)
577
- print(f'User: {question}\nAssistant: {response}')
 
578
 
579
  question = 'What are the similarities and differences between these two images.'
580
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
581
- num_patches_list=num_patches_list,
582
- history=history, return_history=True)
583
- print(f'User: {question}\nAssistant: {response}')
584
 
585
  # batch inference, single image per sample (单图批处理)
586
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
@@ -588,13 +606,15 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
588
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
589
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
590
 
591
- questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
 
592
  responses = model.batch_chat(tokenizer, pixel_values,
593
  num_patches_list=num_patches_list,
594
  questions=questions,
595
  generation_config=generation_config)
596
  for question, response in zip(questions, responses):
597
- print(f'User: {question}\nAssistant: {response}')
 
598
 
599
  # video multi-round conversation (视频多轮对话)
600
  def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
@@ -632,17 +652,24 @@ def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=3
632
  video_path = './examples/red-panda.mp4'
633
  pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
634
  pixel_values = pixel_values.to(torch.bfloat16).cuda()
635
- video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
 
636
  question = video_prefix + 'What is the red panda doing?'
637
- # Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
 
 
 
 
638
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
639
  num_patches_list=num_patches_list, history=None, return_history=True)
640
- print(f'User: {question}\nAssistant: {response}')
 
641
 
642
  question = 'Describe this video in detail.'
643
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
644
  num_patches_list=num_patches_list, history=history, return_history=True)
645
- print(f'User: {question}\nAssistant: {response}')
 
646
  ```
647
 
648
  #### Streaming Output
@@ -726,7 +753,9 @@ image_urls=[
726
 
727
  images = [load_image(img_url) for img_url in image_urls]
728
  # Numbering images improves multi-image conversations
729
- response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
 
 
730
  print(response.text)
731
  ```
732
 
@@ -829,3 +858,14 @@ If you find this project useful in your research, please consider citing:
829
  year={2025}
830
  }
831
  ```
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
2
  base_model:
3
+ - OpenGVLab/InternVL3_5-2B-Pretrained
 
4
  datasets:
5
+ - OpenGVLab/MMPR-v1.2
6
+ - OpenGVLab/MMPR-Tiny
7
  language:
8
+ - multilingual
9
+ library_name: transformers
10
+ license: apache-2.0
11
+ pipeline_tag: image-text-to-text
12
  tags:
13
+ - internvl
14
+ - custom_code
15
+ - multimodal
16
+ - vlm
17
+ - reasoning
18
+ - agent
19
+ - qwen3
20
+ base_model_relation: finetune
21
  ---
22
 
23
  # InternVL3_5-2B-Instruct
 
32
 
33
  ## Introduction
34
 
35
+ We introduce *InternVL3.5*, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the *Cascade Reinforcement Learning (Cascade RL)* framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a *Visual Resolution Router (ViR)* that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled *Vision-Language Deployment (DvD)* strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks -- narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
36
 
37
  ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance.jpg)
38
 
 
534
  # pure-text conversation (纯文本对话)
535
  question = 'Hello, who are you?'
536
  response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
537
+ print(f'User: {question}
538
+ Assistant: {response}')
539
 
540
  question = 'Can you tell me a story?'
541
  response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
542
+ print(f'User: {question}
543
+ Assistant: {response}')
544
 
545
  # single-image single-round conversation (单图单轮对话)
546
+ question = '<image>
547
+ Please describe the image shortly.'
548
  response = model.chat(tokenizer, pixel_values, question, generation_config)
549
+ print(f'User: {question}
550
+ Assistant: {response}')
551
 
552
  # single-image multi-round conversation (单图多轮对话)
553
+ question = '<image>
554
+ Please describe the image in detail.'
555
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
556
+ print(f'User: {question}
557
+ Assistant: {response}')
558
 
559
  question = 'Please write a poem according to the image.'
560
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
561
+ print(f'User: {question}
562
+ Assistant: {response}')
563
 
564
  # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
565
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
566
  pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
567
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
568
 
569
+ question = '<image>
570
+ Describe the two images in detail.'
571
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
572
  history=None, return_history=True)
573
+ print(f'User: {question}
574
+ Assistant: {response}')
575
 
576
  question = 'What are the similarities and differences between these two images.'
577
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
578
  history=history, return_history=True)
579
+ print(f'User: {question}
580
+ Assistant: {response}')
581
 
582
  # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
583
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
 
585
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
586
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
587
 
588
+ question = 'Image-1: <image>
589
+ Image-2: <image>
590
+ Describe the two images in detail.'
591
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
592
  num_patches_list=num_patches_list,
593
  history=None, return_history=True)
594
+ print(f'User: {question}
595
+ Assistant: {response}')
596
 
597
  question = 'What are the similarities and differences between these two images.'
598
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
599
+ num_patches_list=num_patches_list, history=history, return_history=True)
600
+ print(f'User: {question}
601
+ Assistant: {response}')
602
 
603
  # batch inference, single image per sample (单图批处理)
604
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
 
606
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
607
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
608
 
609
+ questions = ['<image>
610
+ Describe the image in detail.'] * len(num_patches_list)
611
  responses = model.batch_chat(tokenizer, pixel_values,
612
  num_patches_list=num_patches_list,
613
  questions=questions,
614
  generation_config=generation_config)
615
  for question, response in zip(questions, responses):
616
+ print(f'User: {question}
617
+ Assistant: {response}')
618
 
619
  # video multi-round conversation (视频多轮对话)
620
  def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
 
652
  video_path = './examples/red-panda.mp4'
653
  pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
654
  pixel_values = pixel_values.to(torch.bfloat16).cuda()
655
+ video_prefix = ''.join([f'Frame{i+1}: <image>
656
+ ' for i in range(len(num_patches_list))])
657
  question = video_prefix + 'What is the red panda doing?'
658
+ # Frame1: <image>
659
+ Frame2: <image>
660
+ ...
661
+ Frame8: <image>
662
+ {question}
663
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
664
  num_patches_list=num_patches_list, history=None, return_history=True)
665
+ print(f'User: {question}
666
+ Assistant: {response}')
667
 
668
  question = 'Describe this video in detail.'
669
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
670
  num_patches_list=num_patches_list, history=history, return_history=True)
671
+ print(f'User: {question}
672
+ Assistant: {response}')
673
  ```
674
 
675
  #### Streaming Output
 
753
 
754
  images = [load_image(img_url) for img_url in image_urls]
755
  # Numbering images improves multi-image conversations
756
+ response = pipe((f'Image-1: {IMAGE_TOKEN}
757
+ Image-2: {IMAGE_TOKEN}
758
+ describe these two images', images))
759
  print(response.text)
760
  ```
761
 
 
858
  year={2025}
859
  }
860
  ```
861
+
862
+
863
+ ## Acknowledgement
864
+
865
+ InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
866
+
867
+ ______________________________________________________________________
868
+
869
+ Scan the following QR Code, join our WeChat group.
870
+
871
+ <p align="center"><img width="300" alt="image" src="https://github.com/user-attachments/assets/f776df09-ebba-4fd5-80c2-fec4ff1518be"></p>