Model Overview

Description:

Qwen2.5-VL-7B-Surg-CholecT50 is a multimodal large language model fine-tuned on the CholecT50 dataset of laparoscopic cholecystectomy procedures to recognize and describe surgical actions, instruments, and targets in endoscopic video frames. Qwen2.5-VL-7B-Surg-CholecT50 was developed by NVIDIA for research in surgical workflow analyses and fine-grained action recognition.

This model is for research and development only.

License/Terms of Use

Please see the NSCLv1 license.

Deployment Geography:

Global

Use Case:

Primarily intended for surgical researchers, healthcare AI developers, or academic institutions exploring laparoscopic action recognition and surgical workflow analytics.

References(s):

Twinanda, A. P., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., & Padoy, N. (2016). Endonet: a deep architecture for recognition tasks on laparoscopic videos.
C.I. Nwoye, N. Padoy. Data Splits and Metrics for Benchmarking Methods on Surgical Action Triplet Datasets. arXiv:2204.05235.

Model Architecture:

Architecture Type: Transformer-based Large Language Model with a Vision Adapter

Network Architecture: Qwen2.5-VL-7B

This model was developed based on Qwen2.5-VL-7B
** Number of model parameters: ~7.0×10^9**

Input:

Input Type(s): Image (endoscopic frame), (Optional) Text Prompt
Input Format: Red, Green, Blue (RGB), String
Input Parameters: Image: Two-Dimensional (2D) laparoscopic image frames (extracted at 1 fps), Text: One-Dimensional (1D)
Other Properties Related to Input: Recommended resolution: 480p or higher. Minimal resizing (e.g., 224×224) if required by the model’s vision encoder. Token limit for text context: up to ~4k tokens.

Output:

Output Type(s): Text
Output Format: String
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: Returns natural language descriptions of recognized instruments, actions, and targets; no bounding boxes or segmentation maps by default. Downstream systems may parse the text output for analytics. NVIDIA GPUs can significantly reduce inference time.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s): Any standard LLM-serving solution (e.g., PyTorch with Triton Inference Server)

**Supported Hardware Microarchitecture Compatibility: **

NVIDIA Ampere (e.g., A100)
NVIDIA Hopper (e.g., H100)

Preferred/Supported Operating System(s):

Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s):

v1.0 (Finetuned on CholecT50)

This model may be used with the MONAI Surgical Agent Framework

Training Dataset:

CholecT50

** Data Modality

Image and Text

** Image Training Data Size

Less than a Million Images

** Text Training Data Size

Less than a Billion Tokens

** Data Collection Method by dataset

Hybrid: Automated, Human

** Labeling Method by dataset

Human

Properties (Quantity, Dataset Descriptions, Sensor(s)): ~~50 laparoscopic cholecystectomy procedures; frames extracted at 1 fps (~~100K training frames); annotations include <instrument, verb, target> triplets.

Testing Dataset:

Link: CholecT50 (holdout portion)

Data Collection Method by dataset:

Hybrid: Automated, Human

Labeling Method by dataset:

Human

Properties (Quantity, Dataset Descriptions, Sensor(s)): ~1–2K frames for testing (approx).

Evaluation Dataset:

Link: CholecT50 (dedicated set never seen during training)

**Benchmark Score
F1-score (Triplets): Instrument: 0.81, Verb: 0.64, Target (Anatomy): 0.60

Data Collection Method by dataset:

Hybrid: Automated, Human

Labeling Method by dataset:

Human

Properties (Quantity, Dataset Descriptions, Sensor(s)): ~1–2K frames for final evaluation.

Inference:

Acceleration Engine: vLLM
Test Hardware: A6000

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Downloads last month: 29

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for nvidia/Qwen2.5-VL-7B-Surg-CholecT50

Quantizations

1 model

Evaluation results

F1 Instrument on CholecT50
self-reported

0.810
F1 Verb on CholecT50
self-reported

0.640
F1 Target on CholecT50
self-reported

0.600

View on Papers With Code