MM-Coder-7B: (from Qwen2-VL-7B)

Introduction

MM-Coder-7B is a multimodal model that can process both text and images, good at generating corresponding code based on UML/FlowChart. It is based on Qwen2-VL-7B and has been fine-tuned on dataset from [MMc-Instruct-Stage1](Coming soon) and MMc-Instruct-Stage2.

Requirements

Verified on:

vllm==0.9.1
transformers==4.49.0
qwen-vl-utils==0.0.11
accelerate==1.9.0

(Note: Higher version transformers may cause errors(https://github.com/vllm-project/vllm/issues/15614), please use the version above.)

Quickstart

Below, we provide simple examples to show inference of MM-Coder-7B with transformers. Our model is fully compatible with Qwen-2-7B-Instruct. More usage method could refer to Qwen-2-VL-7B-Instruct.


from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info


model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Multilingual-Multimodal-NLP/MM-Coder-7B", torch_dtype="auto", device_map="auto"
)

# default processer
processor = AutoProcessor.from_pretrained("Multilingual-Multimodal-NLP/MM-Coder-7B")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Multilingual-Multimodal-NLP/MM-Coder-7B", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "[IMAGE_PATH]",
            },
            {"type": "text", "text": "Use Python to complete the task as described in the diagram:\nDesign a Crop class in a virtual farm management system."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)


#[OUTPUT]
# Here is a comprehensive solution for the Crop class based on the provided diagram:

# ```python
# class Crop:
#     def __init__(self, name, plant_date):
#         self.name = name
#         self.plant_date = plant_date
#         self.status = "Planted"

#     def grow(self):
#         if self.status == "Planted":
#             self.status = "Growing"
#         elif self.status == "Growing":
#             self.status = "Harvested"

#     def get_crop_infos(self):
#         return f"Crop(name={self.name}, status={self.status})"

# ...

Citation

If you find our work helpful, feel free to give us a cite.

@misc{mmcoder,
      title={Multilingual Multimodal Software Developer for Code Generation}, 
      author={Linzheng Chai and Jian Yang and Shukai Liu and Wei Zhang and Liran Wang and Ke Jin and Tao Sun and Congnan Liu and Chenchen Zhang and Hualei Zhu and Jiaheng Liu and Xianjie Wu and Ge Zhang and Tianyu Liu and Zhoujun Li},
      year={2025},
      eprint={2507.08719},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.08719}, 
}

Multilingual-Multimodal-NLP
/

MM-Coder-7B

MM-Coder-7B: (from Qwen2-VL-7B)

Introduction

Requirements

Quickstart

Citation

Model tree for Multilingual-Multimodal-NLP/MM-Coder-7B

Collection including Multilingual-Multimodal-NLP/MM-Coder-7B

Multilingual-Multimodal-Code