Model Overview
Description:
Qwen2.5-VL-7B-Surg-CholecT50 is a multimodal large language model fine-tuned on the CholecT50 dataset of laparoscopic cholecystectomy procedures to recognize and describe surgical actions, instruments, and targets in endoscopic video frames. Qwen2.5-VL-7B-Surg-CholecT50 was developed by NVIDIA for research in surgical workflow analyses and fine-grained action recognition.
This model is for research and development only.
License/Terms of Use
Please see the NSCLv1 license.
Deployment Geography:
Global
Use Case:
Primarily intended for surgical researchers, healthcare AI developers, or academic institutions exploring laparoscopic action recognition and surgical workflow analytics.
References(s):
Twinanda, A. P., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., & Padoy, N. (2016). Endonet: a deep architecture for recognition tasks on laparoscopic videos.
C.I. Nwoye, N. Padoy. Data Splits and Metrics for Benchmarking Methods on Surgical Action Triplet Datasets. arXiv:2204.05235.
Model Architecture:
Architecture Type: Transformer-based Large Language Model with a Vision Adapter
Network Architecture: Qwen2.5-VL-7B
This model was developed based on Qwen2.5-VL-7B
** Number of model parameters: ~7.0ร10^9**
Input:
Input Type(s): Image (endoscopic frame), (Optional) Text Prompt
Input Format: Red, Green, Blue (RGB), String
Input Parameters: Image: Two-Dimensional (2D) laparoscopic image frames (extracted at 1 fps), Text: One-Dimensional (1D)
Other Properties Related to Input: Recommended resolution: 480p or higher. Minimal resizing (e.g., 224ร224) if required by the modelโs vision encoder. Token limit for text context: up to ~4k tokens.
Output:
Output Type(s): Text
Output Format: String
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: Returns natural language descriptions of recognized instruments, actions, and targets; no bounding boxes or segmentation maps by default. Downstream systems may parse the text output for analytics. NVIDIA GPUs can significantly reduce inference time.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIAโs hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration:
Runtime Engine(s): Any standard LLM-serving solution (e.g., PyTorch with Triton Inference Server)
**Supported Hardware Microarchitecture Compatibility: **
- NVIDIA Ampere (e.g., A100)
- NVIDIA Hopper (e.g., H100)
Preferred/Supported Operating System(s):
- Linux
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
Model Version(s):
v1.0 (Finetuned on CholecT50)
This model may be used with the MONAI Surgical Agent Framework
Training Dataset:
** Data Modality
- Image and Text
** Image Training Data Size
- Less than a Million Images
** Text Training Data Size
- Less than a Billion Tokens
** Data Collection Method by dataset
- Hybrid: Automated, Human
** Labeling Method by dataset
- Human
Properties (Quantity, Dataset Descriptions, Sensor(s)): 50 laparoscopic cholecystectomy procedures; frames extracted at 1 fps (100K training frames); annotations include <instrument, verb, target> triplets.
Testing Dataset:
Link: CholecT50 (holdout portion)
Data Collection Method by dataset:
- Hybrid: Automated, Human
Labeling Method by dataset:
- Human
Properties (Quantity, Dataset Descriptions, Sensor(s)): ~1โ2K frames for testing (approx).
Evaluation Dataset:
Link: CholecT50 (dedicated set never seen during training)
**Benchmark Score
F1-score (Triplets): Instrument: 0.81, Verb: 0.64, Target (Anatomy): 0.60
Data Collection Method by dataset:
- Hybrid: Automated, Human
Labeling Method by dataset:
- Human
Properties (Quantity, Dataset Descriptions, Sensor(s)): ~1โ2K frames for final evaluation.
Inference:
Acceleration Engine: vLLM
Test Hardware: A6000
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
- Downloads last month
- 29
Model tree for nvidia/Qwen2.5-VL-7B-Surg-CholecT50
Evaluation results
- F1 Instrument on CholecT50self-reported0.810
- F1 Verb on CholecT50self-reported0.640
- F1 Target on CholecT50self-reported0.600