File size: 4,024 Bytes
4b07da9
8e32370
 
 
 
4b07da9
 
 
 
 
 
 
 
 
 
 
79dcccd
4b07da9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133

---
license: apache-2.0
language:
- en
pipeline_tag: image-text-to-text
tags:
- multimodal
library_name: transformers
---

# MM-Coder-7B: (from Qwen2-VL-7B)


## Introduction

MM-Coder-7B is a multimodal model that can process both text and images, good at generating corresponding code based on UML/FlowChart. It is based on Qwen2-VL-7B and has been fine-tuned on dataset from [MMc-Instruct-Stage1](Coming soon) and [MMc-Instruct-Stage2](https://huggingface.co/datasets/Multilingual-Multimodal-NLP/MMc-Instruct-Stage2).


## Requirements
Verified on:
- vllm==0.9.1
- transformers==4.49.0
- qwen-vl-utils==0.0.11
- accelerate==1.9.0

(Note: Higher version transformers may cause errors(https://github.com/vllm-project/vllm/issues/15614), please use the version above.)


## Quickstart

Below, we provide simple examples to show inference of MM-Coder-7B with transformers. Our model is fully compatible with Qwen-2-7B-Instruct. More usage method could refer to
[Qwen-2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).

```python

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info


model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Multilingual-Multimodal-NLP/MM-Coder-7B", torch_dtype="auto", device_map="auto"
)

# default processer
processor = AutoProcessor.from_pretrained("Multilingual-Multimodal-NLP/MM-Coder-7B")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Multilingual-Multimodal-NLP/MM-Coder-7B", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "[IMAGE_PATH]",
            },
            {"type": "text", "text": "Use Python to complete the task as described in the diagram:\nDesign a Crop class in a virtual farm management system."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)


#[OUTPUT]
# Here is a comprehensive solution for the Crop class based on the provided diagram:

# ```python
# class Crop:
#     def __init__(self, name, plant_date):
#         self.name = name
#         self.plant_date = plant_date
#         self.status = "Planted"

#     def grow(self):
#         if self.status == "Planted":
#             self.status = "Growing"
#         elif self.status == "Growing":
#             self.status = "Harvested"

#     def get_crop_infos(self):
#         return f"Crop(name={self.name}, status={self.status})"

# ...
```





## Citation

If you find our work helpful, feel free to give us a cite.

```
@misc{mmcoder,
      title={Multilingual Multimodal Software Developer for Code Generation}, 
      author={Linzheng Chai and Jian Yang and Shukai Liu and Wei Zhang and Liran Wang and Ke Jin and Tao Sun and Congnan Liu and Chenchen Zhang and Hualei Zhu and Jiaheng Liu and Xianjie Wu and Ge Zhang and Tianyu Liu and Zhoujun Li},
      year={2025},
      eprint={2507.08719},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.08719}, 
}

```