Files changed (1) hide show
  1. README.md +547 -0
README.md ADDED
@@ -0,0 +1,547 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ license_link: https://huggingface.co/UbiquantAI/Fleming-R1-32B/blob/main/LICENSE
5
+ pipeline_tag: text-generation
6
+ ---
7
+
8
+ # Fleming-VL-8B
9
+ <p align="center" style="margin: 0;">
10
+ <a href="https://github.com/UbiquantAI/Fleming-R1" aria-label="GitHub Repository" style="text-decoration:none;">
11
+ <span style="display:inline-flex;align-items:center;gap:.35em;">
12
+ <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"
13
+ width="16" height="16" aria-hidden="true"
14
+ style="vertical-align:text-bottom;fill:currentColor;">
15
+ <path d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59.4.07.55-.17.55-.38 0-.19-.01-.82-.01-1.49-2.01.37-2.53-.49-2.69-.94-.09-.23-.48-.94-.82-1.13-.28-.15-.68-.52-.01-.53.63-.01 1.08.58 1.23.82.72 1.21 1.87.87 2.33.66.07-.52.28-.87.51-1.07-1.78-.2-3.64-.89-3.64-3.95 0-.87.31-1.59.82-2.15-.08-.2-.36-1.02.08-2.12 0 0 .67-.21 2.2.82.64-.18 1.32-.27 2-.27.68 0 1.36.09 2 .27 1.53-1.04 2.2-.82 2.2-.82.44 1.1.16 1.92.08 2.12.51.56.82 1.27.82 2.15 0 3.07-1.87 3.75-3.65 3.95.29.25.54.73.54 1.48 0 1.07-.01 1.93-.01 2.2 0 .21.15.46.55.38A8.013 8.013 0 0016 8c0-4.42-3.58-8-8-8Z"/>
16
+ </svg>
17
+ <span>GitHub</span>
18
+ </span>
19
+ </a>
20
+ <span style="margin:0 .75em;opacity:.6;">•</span>
21
+ <a href="https://arxiv.org/abs/2509.15279" aria-label="Paper">📑&nbsp;Paper</a>
22
+ </p>
23
+
24
+ ## Highlights
25
+
26
+ ## 📖 Model Overview
27
+
28
+ Fleming-VL is a multimodal reasoning model for medical scenarios that can process and analyze various types of medical data including 2D images, 3D volumetric data, and video sequences. The model performs step-by-step analysis of complex multimodal medical problems and produces reliable answers. Building upon the GRPO reasoning paradigm, Fleming-VL extends the capabilities to handle diverse medical imaging modalities while maintaining strong reasoning performance.
29
+
30
+ **Model Features:**
31
+
32
+ * **Multimodal Processing** Supports various medical data types including 2D images (X-rays, pathology slides), 3D volumes (CT/MRI scans), and videos (ultrasound, endoscopy, surgical recordings);
33
+ * **Medical Reasoning** Performs step-by-step chain-of-thought reasoning to analyze complex medical problems, combining visual information with medical knowledge to provide reliable diagnostic insights.
34
+ ## 📦 Releases
35
+
36
+ - **Fleming-VL-7B** —— Trained on InternVL3-8B
37
+ 🤗 [`UbiquantAI/Fleming-VL-8B`](https://huggingface.co/UbiquantAI/Fleming-VL-8B)
38
+ - **Fleming-VL-38B** —— Trained on InternVL3-38B
39
+ 🤗 [`UbiquantAI/Fleming-VL-8B`](https://huggingface.co/UbiquantAI/Fleming-VL-38B)
40
+
41
+ ## 📊 Performance
42
+
43
+ ### Main Benchmark Results
44
+
45
+ <div align="center">
46
+ <img src="images/exp_result.png" alt="Benchmark Results" width="60%">
47
+ </div>
48
+
49
+
50
+ ## 🔧 Quick Start
51
+
52
+ ```python
53
+ """
54
+ Fleming-VL-8B Multi-Modal Inference Script
55
+
56
+ This script demonstrates three inference modes:
57
+ 1. Single image inference
58
+ 2. Video inference (frame-by-frame)
59
+ 3. 3D medical image (CT/MRI) inference from .npy files
60
+
61
+ Model: UbiquantAI/Fleming-VL-8B
62
+ Based on: InternVL_chat-1.2 template
63
+ """
64
+
65
+ from transformers import AutoTokenizer, AutoModel, CLIPImageProcessor
66
+ from decord import VideoReader, cpu
67
+ from PIL import Image
68
+ import numpy as np
69
+ import shutil
70
+ import torch
71
+ import os
72
+
73
+
74
+ # ============================================================================
75
+ # Configuration
76
+ # ============================================================================
77
+
78
+ MODEL_PATH = "UbiquantAI/Fleming-VL-8B"
79
+ REQUIRED_FILES_DIR = './required_files'
80
+
81
+ # Prompt template for reasoning-based responses
82
+ REASONING_PROMPT = (
83
+ "A conversation between User and Assistant. The user asks a question, "
84
+ "and the Assistant solves it. The assistant first thinks about the "
85
+ "reasoning process in the mind and then provides the user a concise "
86
+ "final answer in a short word or phrase. The reasoning process and "
87
+ "answer are enclosed within <think> </think> and <answer> </answer> "
88
+ "tags, respectively, i.e., <think> reasoning process here </think>"
89
+ "<answer> answer here </answer>"
90
+ )
91
+
92
+
93
+ # ============================================================================
94
+ # Utility Functions
95
+ # ============================================================================
96
+
97
+ def copy_necessary_files(target_path, source_path):
98
+ """
99
+ Copy required model configuration files to the model directory.
100
+
101
+ Args:
102
+ target_path: Destination directory (model path)
103
+ source_path: Source directory containing required files
104
+ """
105
+ required_files = [
106
+ "modeling_internvl_chat.py",
107
+ "conversation.py",
108
+ "modeling_intern_vit.py",
109
+ "preprocessor_config.json",
110
+ "configuration_internvl_chat.py",
111
+ "configuration_intern_vit.py",
112
+ ]
113
+
114
+ for filename in required_files:
115
+ target_file = os.path.join(target_path, filename)
116
+ source_file = os.path.join(source_path, filename)
117
+
118
+ if not os.path.exists(target_file):
119
+ print(f"File {filename} not found in target path, copying from source...")
120
+
121
+ if os.path.exists(source_file):
122
+ try:
123
+ shutil.copy2(source_file, target_file)
124
+ print(f"Successfully copied {filename}")
125
+ except Exception as e:
126
+ print(f"Error copying {filename}: {str(e)}")
127
+ else:
128
+ print(f"Warning: Source file {filename} does not exist, cannot copy")
129
+ else:
130
+ print(f"File {filename} already exists")
131
+
132
+
133
+ def load_model(model_path, use_flash_attn=True):
134
+ """
135
+ Load the vision-language model and tokenizer.
136
+
137
+ Args:
138
+ model_path: Path to the pretrained model
139
+ use_flash_attn: Whether to use flash attention (default: True)
140
+
141
+ Returns:
142
+ tuple: (model, tokenizer)
143
+ """
144
+ model = AutoModel.from_pretrained(
145
+ model_path,
146
+ torch_dtype=torch.bfloat16,
147
+ low_cpu_mem_usage=True,
148
+ use_flash_attn=use_flash_attn,
149
+ trust_remote_code=True
150
+ ).eval().cuda()
151
+
152
+ tokenizer = AutoTokenizer.from_pretrained(
153
+ model_path,
154
+ trust_remote_code=True,
155
+ use_fast=False
156
+ )
157
+
158
+ return model, tokenizer
159
+
160
+
161
+ # ============================================================================
162
+ # Image Inference
163
+ # ============================================================================
164
+
165
+ def inference_single_image(model, tokenizer, image_path, question, prompt=REASONING_PROMPT):
166
+ """
167
+ Perform inference on a single image.
168
+
169
+ Args:
170
+ model: Loaded vision-language model
171
+ tokenizer: Loaded tokenizer
172
+ image_path: Path to the input image
173
+ question: Question to ask about the image
174
+ prompt: System prompt template
175
+
176
+ Returns:
177
+ str: Model response
178
+ """
179
+ # Load and preprocess image
180
+ image_processor = CLIPImageProcessor.from_pretrained(MODEL_PATH)
181
+ image = Image.open(image_path).resize((448, 448))
182
+ pixel_values = image_processor(
183
+ images=image,
184
+ return_tensors='pt'
185
+ ).pixel_values.to(torch.bfloat16).cuda()
186
+
187
+ # Prepare question with prompt and image token
188
+ full_question = f"{prompt}\n<image>\n{question}"
189
+
190
+ # Generate response
191
+ generation_config = dict(max_new_tokens=1024, do_sample=False)
192
+ response = model.chat(tokenizer, pixel_values, full_question, generation_config)
193
+
194
+ return response
195
+
196
+
197
+ # ============================================================================
198
+ # Video Inference
199
+ # ============================================================================
200
+
201
+ def get_frame_indices(bound, fps, max_frame, first_idx=0, num_segments=32):
202
+ """
203
+ Calculate evenly distributed frame indices for video sampling.
204
+
205
+ Args:
206
+ bound: Tuple of (start_time, end_time) in seconds, or None for full video
207
+ fps: Frames per second of the video
208
+ max_frame: Maximum frame index
209
+ first_idx: First frame index to consider
210
+ num_segments: Number of frames to sample
211
+
212
+ Returns:
213
+ np.array: Array of frame indices
214
+ """
215
+ if bound:
216
+ start, end = bound[0], bound[1]
217
+ else:
218
+ start, end = -100000, 100000
219
+
220
+ start_idx = max(first_idx, round(start * fps))
221
+ end_idx = min(round(end * fps), max_frame)
222
+ seg_size = float(end_idx - start_idx) / num_segments
223
+
224
+ frame_indices = np.array([
225
+ int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
226
+ for idx in range(num_segments)
227
+ ])
228
+
229
+ return frame_indices
230
+
231
+
232
+ def load_video(video_path, model_path, bound=None, num_segments=32):
233
+ """
234
+ Load and preprocess video frames.
235
+
236
+ Args:
237
+ video_path: Path to the video file
238
+ model_path: Path to the model (for image processor)
239
+ bound: Time boundary tuple (start, end) in seconds
240
+ num_segments: Number of frames to extract
241
+
242
+ Returns:
243
+ tuple: (pixel_values tensor, list of num_patches per frame)
244
+ """
245
+ vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
246
+ max_frame = len(vr) - 1
247
+ fps = float(vr.get_avg_fps())
248
+
249
+ pixel_values_list = []
250
+ num_patches_list = []
251
+ image_processor = CLIPImageProcessor.from_pretrained(model_path)
252
+
253
+ frame_indices = get_frame_indices(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
254
+
255
+ for frame_index in frame_indices:
256
+ # Extract and preprocess frame
257
+ img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB').resize((448, 448))
258
+ pixel_values = image_processor(images=img, return_tensors='pt').pixel_values
259
+ num_patches_list.append(pixel_values.shape[0])
260
+ pixel_values_list.append(pixel_values)
261
+
262
+ pixel_values = torch.cat(pixel_values_list)
263
+ return pixel_values, num_patches_list
264
+
265
+
266
+ def inference_video(model, tokenizer, video_path, video_duration, question, prompt=REASONING_PROMPT):
267
+ """
268
+ Perform inference on a video by sampling frames.
269
+
270
+ Args:
271
+ model: Loaded vision-language model
272
+ tokenizer: Loaded tokenizer
273
+ video_path: Path to the video file
274
+ video_duration: Duration of video in seconds
275
+ question: Question to ask about the video
276
+ prompt: System prompt template
277
+
278
+ Returns:
279
+ str: Model response
280
+ """
281
+ # Sample frames from video (1 frame per second)
282
+ num_segments = int(video_duration)
283
+ pixel_values, num_patches_list = load_video(video_path, MODEL_PATH, num_segments=num_segments)
284
+ pixel_values = pixel_values.to(torch.bfloat16).cuda()
285
+
286
+ # Create image token prefix for all frames
287
+ video_prefix = ''.join([f'<image>\n' for _ in range(len(num_patches_list))])
288
+
289
+ # Prepare question with prompt and image tokens
290
+ full_question = f"{prompt}\n{video_prefix}{question}"
291
+
292
+ # Generate response
293
+ generation_config = dict(max_new_tokens=1024, do_sample=False)
294
+ response, history = model.chat(
295
+ tokenizer,
296
+ pixel_values,
297
+ full_question,
298
+ generation_config,
299
+ num_patches_list=num_patches_list,
300
+ history=None,
301
+ return_history=True
302
+ )
303
+
304
+ return response
305
+
306
+
307
+ # ============================================================================
308
+ # 3D Medical Image (NPY) Inference
309
+ # ============================================================================
310
+
311
+ def normalize_image(image):
312
+ """
313
+ Normalize image array to 0-255 range.
314
+
315
+ Args:
316
+ image: NumPy array of image data
317
+
318
+ Returns:
319
+ np.array: Normalized image as uint8
320
+ """
321
+ img_min = np.min(image)
322
+ img_max = np.max(image)
323
+
324
+ if img_max - img_min == 0:
325
+ return np.zeros_like(image, dtype=np.uint8)
326
+
327
+ return ((image - img_min) / (img_max - img_min) * 255).astype(np.uint8)
328
+
329
+
330
+ def convert_npy_to_images(npy_path, model_path, num_slices=11):
331
+ """
332
+ Convert 3D medical image (.npy) to multiple 2D RGB images.
333
+
334
+ Expected input shape: (32, 256, 256) or (1, 32, 256, 256)
335
+ Extracts evenly distributed slices and converts to RGB format.
336
+
337
+ Args:
338
+ npy_path: Path to the .npy file
339
+ model_path: Path to the model (for image processor)
340
+ num_slices: Number of slices to extract (default: 11)
341
+
342
+ Returns:
343
+ tuple: (pixel_values tensor, list of num_patches per slice) or False if error
344
+ """
345
+ try:
346
+ # Load .npy file
347
+ data = np.load(npy_path)
348
+
349
+ # Handle shape (1, 32, 256, 256) -> (32, 256, 256)
350
+ if data.ndim == 4 and data.shape[0] == 1:
351
+ data = data[0]
352
+
353
+ # Validate shape
354
+ if data.shape != (32, 256, 256):
355
+ print(f"Warning: {npy_path} has shape {data.shape}, expected (32, 256, 256), skipping")
356
+ return False
357
+
358
+ # Select evenly distributed slices from 32 slices
359
+ indices = np.linspace(0, 31, num_slices, dtype=int)
360
+
361
+ image_processor = CLIPImageProcessor.from_pretrained(model_path)
362
+ pixel_values_list = []
363
+ num_patches_list = []
364
+
365
+ # Process each selected slice
366
+ for idx in indices:
367
+ # Get slice
368
+ slice_img = data[idx]
369
+
370
+ # Normalize to 0-255
371
+ normalized = normalize_image(slice_img)
372
+
373
+ # Convert grayscale to RGB by stacking
374
+ rgb_img = np.stack([normalized, normalized, normalized], axis=-1)
375
+
376
+ # Convert to PIL Image
377
+ img = Image.fromarray(rgb_img)
378
+
379
+ # Preprocess with CLIP processor
380
+ pixel_values = image_processor(images=img, return_tensors='pt').pixel_values
381
+ num_patches_list.append(pixel_values.shape[0])
382
+ pixel_values_list.append(pixel_values)
383
+
384
+ pixel_values = torch.cat(pixel_values_list)
385
+ return pixel_values, num_patches_list
386
+
387
+ except Exception as e:
388
+ print(f"Error processing {npy_path}: {str(e)}")
389
+ return False
390
+
391
+
392
+ def inference_3d_medical_image(model, tokenizer, npy_path, question, prompt=REASONING_PROMPT):
393
+ """
394
+ Perform inference on 3D medical images stored as .npy files.
395
+
396
+ Args:
397
+ model: Loaded vision-language model
398
+ tokenizer: Loaded tokenizer
399
+ npy_path: Path to the .npy file (shape: 32x256x256)
400
+ question: Question to ask about the image
401
+ prompt: System prompt template
402
+
403
+ Returns:
404
+ str: Model response or None if error
405
+ """
406
+ # Convert 3D volume to multiple 2D slices
407
+ result = convert_npy_to_images(npy_path, MODEL_PATH)
408
+
409
+ if result is False:
410
+ return None
411
+
412
+ pixel_values, num_patches_list = result
413
+ pixel_values = pixel_values.to(torch.bfloat16).cuda()
414
+
415
+ # Create image token prefix for all slices
416
+ image_prefix = ''.join([f'<image>\n' for _ in range(len(num_patches_list))])
417
+
418
+ # Prepare question with prompt and image tokens
419
+ full_question = f"{prompt}\n{image_prefix}{question}"
420
+
421
+ # Generate response
422
+ generation_config = dict(max_new_tokens=1024, do_sample=False)
423
+ response, history = model.chat(
424
+ tokenizer,
425
+ pixel_values,
426
+ full_question,
427
+ generation_config,
428
+ num_patches_list=num_patches_list,
429
+ history=None,
430
+ return_history=True
431
+ )
432
+
433
+ return response
434
+
435
+
436
+ # ============================================================================
437
+ # Main Execution Examples
438
+ # ============================================================================
439
+
440
+ def main():
441
+ """
442
+ Main function demonstrating all three inference modes.
443
+ """
444
+ # Copy necessary files
445
+ copy_necessary_files(MODEL_PATH, REQUIRED_FILES_DIR)
446
+
447
+ # ========================================================================
448
+ # Example 1: Single Image Inference
449
+ # ========================================================================
450
+ print("\n" + "="*80)
451
+ print("EXAMPLE 1: Single Image Inference")
452
+ print("="*80)
453
+
454
+ image_path = "./test.png"
455
+ question = (
456
+ "What imaging technique was employed to obtain this picture?\n"
457
+ "A. PET scan. B. CT scan. C. Blood test. D. Fundus imaging."
458
+ )
459
+
460
+ model, tokenizer = load_model(MODEL_PATH, use_flash_attn=True)
461
+ response = inference_single_image(model, tokenizer, image_path, question)
462
+
463
+ print(f"\nUser: {question}")
464
+ print(f"Assistant: {response}")
465
+
466
+ # Clean up GPU memory
467
+ del model, tokenizer
468
+ torch.cuda.empty_cache()
469
+
470
+ # ========================================================================
471
+ # Example 2: Video Inference
472
+ # ========================================================================
473
+ print("\n" + "="*80)
474
+ print("EXAMPLE 2: Video Inference")
475
+ print("="*80)
476
+
477
+ video_path = "./test.mp4"
478
+ video_duration = 6 # seconds
479
+ question = "Please describe the video."
480
+
481
+ model, tokenizer = load_model(MODEL_PATH, use_flash_attn=False)
482
+ response = inference_video(model, tokenizer, video_path, video_duration, question)
483
+
484
+ print(f"\nUser: {question}")
485
+ print(f"Assistant: {response}")
486
+
487
+ # Clean up GPU memory
488
+ del model, tokenizer
489
+ torch.cuda.empty_cache()
490
+
491
+ # ========================================================================
492
+ # Example 3: 3D Medical Image Inference
493
+ # ========================================================================
494
+ print("\n" + "="*80)
495
+ print("EXAMPLE 3: 3D Medical Image Inference")
496
+ print("="*80)
497
+
498
+ npy_path = "./test.npy"
499
+ question = "What device is observed on the chest wall?"
500
+
501
+ # Example cases:
502
+ # Case 1: /path/to/test_1016_d_2.npy
503
+ # Question: "Where is the largest lymph node observed?"
504
+ # Answer: "Right hilar region."
505
+ #
506
+ # Case 2: /path/to/test_1031_a_2.npy
507
+ # Question: "What device is observed on the chest wall?"
508
+ # Answer: "Pacemaker."
509
+
510
+ model, tokenizer = load_model(MODEL_PATH, use_flash_attn=False)
511
+ response = inference_3d_medical_image(model, tokenizer, npy_path, question)
512
+
513
+ if response:
514
+ print(f"\nUser: {question}")
515
+ print(f"Assistant: {response}")
516
+ else:
517
+ print("\nError: Failed to process 3D medical image")
518
+
519
+ # Clean up GPU memory
520
+ del model, tokenizer
521
+ torch.cuda.empty_cache()
522
+
523
+
524
+ if __name__ == "__main__":
525
+ main()
526
+
527
+ ```
528
+
529
+ ## ⚠️ Safety Statement
530
+
531
+ This project is for research and non-clinical reference only; it must not be used for actual diagnosis or treatment decisions.
532
+ The generated reasoning traces are an auditable intermediate process and do not constitute medical advice.
533
+ In medical scenarios, results must be reviewed and approved by qualified professionals, and all applicable laws, regulations, and privacy compliance requirements in your region must be followed.
534
+
535
+ ## 📚 Citation
536
+
537
+ ```bibtex
538
+ @misc{flemingr1,
539
+ title={Fleming-R1: Toward Expert-Level Medical Reasoning via Reinforcement Learning},
540
+ author={Chi Liu and Derek Li and Yan Shu and Robin Chen and Derek Duan and Teng Fang and Bryan Dai},
541
+ year={2025},
542
+ eprint={2509.15279},
543
+ archivePrefix={arXiv},
544
+ primaryClass={cs.LG},
545
+ url={https://arxiv.org/abs/2509.15279},
546
+ }
547
+ ```